Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ena NIC msi_irq is empty on ubuntu kernel 5.19 - causing a severe throughput degradation for i3.4xlarge (and above?) #13560

Closed
aleksbykov opened this issue Apr 18, 2023 · 63 comments
Labels
P1 Urgent status/regression status/release blocker Preventing from a release to be promoted symptom/performance Issues causing performance problems
Milestone

Comments

@aleksbykov
Copy link
Contributor

Installation details
Scylla version (or git commit hash): 5.3.0~dev-0.20230415.1da02706ddb8 with build-id f7ac5cd90e63ace5065c583d6d1d9c381f39b5c2
Cluster size: 3
OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-0501eb17c8c79b6d2

Performance latency test jobs:

Performance latency 1tb test run prepare command to populate dataset with size 1TB with c-s command:

["cassandra-stress write no-warmup cl=ALL n=250000000 -schema 'replication(factor=3)' -mode cql3 native -rate threads=200 -col 'size=FIXED(1024) n=FIXED(1)' -pop seq=1..250000000",
 "cassandra-stress write no-warmup cl=ALL n=250000000 -schema 'replication(factor=3)' -mode cql3 native -rate threads=200 -col 'size=FIXED(1024) n=FIXED(1)' -pop seq=250000001..500000000",
"cassandra-stress write no-warmup cl=ALL n=250000000 -schema 'replication(factor=3)' -mode cql3 native -rate threads=200 -col 'size=FIXED(1024) n=FIXED(1)' -pop seq=500000001..750000000",
 "cassandra-stress write no-warmup cl=ALL n=250000000 -schema 'replication(factor=3)' -mode cql3 native -rate threads=200 -col 'size=FIXED(1024) n=FIXED(1)' -pop seq=750000001..1000000000"]

All cassandra stress commands run with cl=ALL.
in 30 minutes after prepare stress-commands started, all nodes start reported a lot of Reactor stalls about 6-20 ms and lsa-time records to log. Decoded reactor stalled attached:reactor_stalls_decoded_nodes.zip

on monitoring: http://3.237.101.163:3000/d/sZoKwKP4k/scylla-enterprise-perf-regression-latency-shard-aware-1tb-test-scylla-per-server-metrics-nemesis-master?orgId=1&from=1681736652253&to=1681773381114
we see next problems for compactions:
Screenshot from 2023-04-18 16-43-46
VS latest successfull run for scylla: 5.3.0~dev-0.20230316.5705df77a155
Screenshot from 2023-04-18 16-47-37

and next for 'Writes currently blocked on dirty'
Screenshot from 2023-04-18 16-44-15
VS latest successfull run for scylla: 5.3.0~dev-0.20230316.5705df77a155
Screenshot from 2023-04-18 16-48-52

Latest successful run is for scylla: 5.3.0~dev-0.20230316.5705df77a155
job: https://jenkins.scylladb.com/view/New%20Performance%20Jobs/job/scylla-master/job/scylla-master-perf-regression-latency-shard-aware-1TB-test/14

issue start appeared from: 5.3.0dev-0.20230331.160c184d0b0f with build-id 13d781a3205d092514f6642c9787a566aba7c110
latest successfull run: 5.3.0
dev-0.20230316.5705df77a155

Possible scylla commit which bring the issue: 472b155

DB logs: https://cloudius-jenkins-test.s3.amazonaws.com/540d1d32-9700-4c1c-aecc-168260285dd3/20230417_225030/db-cluster-540d1d32.tar.gz

@mykaul mykaul added symptom/performance Issues causing performance problems status/regression labels Apr 18, 2023
@roydahan
Copy link

@raphaelsc / @bhalevy my suspicion is that it's something related to compactions, looks like the compactions are not "steady" during the entire write and the effect is severe.
Was there any change in the last month of master that may explain this?

@aleksbykov let's try to bisect it to a smaller range.
We already know the last successful and the first time it failed (not this run).
Let's find a build between them and re-run it with it.

@bhalevy
Copy link
Member

bhalevy commented Apr 18, 2023

all nodes start reported a lot of Reactor stalls about 6-20 ms and lsa-time records to log. Decoded reactor stalled attached:reactor_stalls_decoded_nodes.zip

The reactor stalls are relatively short. For example, on node1:

Processed 2486 stalls lasting a total of 15126 milliseconds.
min=6 avg=6.1 median=6 p95=7 p99=8 p999=9 max=13

This graph is printed in bottom-up order, where callees are printed first.
Use --direction=top-down to print callers first.

[level#index/out_of pct%] below denotes:
  level  - nesting level in the graph
  index  - index of node among to its siblings
  out_of - number of siblings
  pct    - percentage of total stall time of this call relative to its siblings

Command line options: --address-threshold=0x100000000 --executable=/home/abykov/tmp/performance/540d1d32-9700-4c1c-aecc-168260285dd3/decode_reactor_stalls/scylla/libexec/scylla --direction=bottom-up --branch-threshold=0.05 --file=../reactor_stalls_node1.log

++[0#1/1 100%] addr=0x542c633 total=15126 count=2486 avg=6:
|              seastar::backtrace<seastar::backtrace_buffer::append_backtrace_oneline()::{lambda(seastar::frame)#1}> at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:60
|              (inlined by) seastar::backtrace_buffer::append_backtrace_oneline at ./build/release/seastar/./seastar/src/core/reactor.cc:797
|              (inlined by) seastar::print_with_backtrace at ./build/release/seastar/./seastar/src/core/reactor.cc:816
++           - addr=0x542b9e0:
|              seastar::internal::cpu_stall_detector::generate_trace at ./build/release/seastar/./seastar/src/core/reactor.cc:1386
++           - addr=0x542ce0f:
|              seastar::internal::cpu_stall_detector::maybe_report at ./build/release/seastar/./seastar/src/core/reactor.cc:1123
|              (inlined by) seastar::internal::cpu_stall_detector::on_signal at ./build/release/seastar/./seastar/src/core/reactor.cc:1143
|              (inlined by) seastar::reactor::block_notifier at ./build/release/seastar/./seastar/src/core/reactor.cc:1423
++           - addr=0x3cb1f: ?? ??:0
  ++[1#1/669 7%] addr=0x1e23ba5 total=1052 count=173 avg=6:
  |              utils::uleb64_decode_forwards<void (&)(char const*, unsigned long), void (&)(char const*, unsigned long)> at ././utils/vle.hh:105
  |              (inlined by) logalloc::region_impl::object_descriptor::decode_forwards at ./utils/logalloc.cc:1619
  |              (inlined by) logalloc::region_impl::for_each_live<logalloc::region_impl::compact_segment_locked(logalloc::segment*, logalloc::segment_descriptor&)::{lambda(logalloc::region_impl::object_descriptor const*, void*, unsigned long)#1}> at ./utils/logalloc.cc:1717
  |              (inlined by) logalloc::region_impl::compact_segment_locked at ./utils/logalloc.cc:1873
  | ++[2#1/3 64%] addr=0x1e1a56b total=671 count=110 avg=6:
  | |             logalloc::region_impl::compact at ./utils/logalloc.cc:2213
  | |             (inlined by) logalloc::reclaim_from_evictable at ./utils/logalloc.cc:2532
  | | ++[3#1/1 100%] addr=0x1e18824 total=6973 count=1145 avg=6:
  | | |              logalloc::tracker::impl::compact_and_evict_locked at ./utils/logalloc.cc:2691
  | |   ++[4#1/1 100%] addr=0x1e203f0 total=9103 count=1497 avg=6:
  | |   |              logalloc::tracker::impl::compact_and_evict at ./utils/logalloc.cc:2618
  | |   |              (inlined by) logalloc::segment_pool::allocate_segment at ./utils/logalloc.cc:1279
  | |   |              (inlined by) logalloc::segment_pool::refill_emergency_reserve at ./utils/logalloc.cc:1292
  | |   |              (inlined by) logalloc::allocating_section::reserve at ./utils/logalloc.cc:2871
  | |     ++[5#1/8 35%] addr=0x1df131d total=4295 count=705 avg=6:
  | |     |             logalloc::allocating_section::with_reserve<logalloc::allocating_section::operator()<row_cache::do_update<row_cache::update(row_cache::external_updater, replica::memtable&)::$_21>(row_cache::external_updater, replica::memtable&, row_cache::update(row_cache::external_updater, replica::memtable&)::$_21)::{lambda()#1}::operator()() const::{lambda()#2}::operator()()::{lambda()#3}::operator()() const::{lambda()#1}>(logalloc::region&, row_cache::do_update<row_cache::update(row_cache::external_updater, replica::memtable&)::$_21>(row_cache::external_updater, replica::memtable&, row_cache::update(row_cache::external_updater, replica::memtable&)::$_21)::{lambda()#1}::operator()() const::{lambda()#2}::operator()()::{lambda()#3}::operator()() const::{lambda()#1}&&)::{lambda()#1}> at ././utils/logalloc.hh:469
  | |     |             (inlined by) logalloc::allocating_section::operator()<row_cache::do_update<row_cache::update(row_cache::external_updater, replica::memtable&)::$_21>(row_cache::external_updater, replica::memtable&, row_cache::update(row_cache::external_updater, replica::memtable&)::$_21)::{lambda()#1}::operator()() const::{lambda()#2}::operator()()::{lambda()#3}::operator()() const::{lambda()#1}> at ././utils/logalloc.hh:520
  | |     |             (inlined by) operator() at ./row_cache.cc:971
  | |     |             (inlined by) with_allocator<row_cache::do_update<row_cache::update(row_cache::external_updater, replica::memtable&)::$_21>(row_cache::external_updater, replica::memtable&, row_cache::update(row_cache::external_updater, replica::memtable&)::$_21)::{lambda()#1}::operator()() const::{lambda()#2}::operator()()::{lambda()#3}> at ././utils/allocation_strategy.hh:313
  | |     |             (inlined by) operator() at ./row_cache.cc:961
  | |     |             (inlined by) std::__invoke_impl<void, row_cache::do_update<row_cache::update(row_cache::external_updater, replica::memtable&)::$_21>(row_cache::external_updater, replica::memtable&, row_cache::update(row_cache::external_updater, replica::memtable&)::$_21)::{lambda()#1}::operator()() const::{lambda()#2}> at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/invoke.h:61
  | |     |             (inlined by) std::__invoke<row_cache::do_update<row_cache::update(row_cache::external_updater, replica::memtable&)::$_21>(row_cache::external_updater, replica::memtable&, row_cache::update(row_cache::external_updater, replica::memtable&)::$_21)::{lambda()#1}::operator()() const::{lambda()#2}> at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/bits/invoke.h:96
  | |     |             (inlined by) std::__apply_impl<row_cache::do_update<row_cache::update(row_cache::external_updater, replica::memtable&)::$_21>(row_cache::external_updater, replica::memtable&, row_cache::update(row_cache::external_updater, replica::memtable&)::$_21)::{lambda()#1}::operator()() const::{lambda()#2}, std::tuple<>> at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/tuple:1852
  | |     |             (inlined by) std::apply<row_cache::do_update<row_cache::update(row_cache::external_updater, replica::memtable&)::$_21>(row_cache::external_updater, replica::memtable&, row_cache::update(row_cache::external_updater, replica::memtable&)::$_21)::{lambda()#1}::operator()() const::{lambda()#2}, std::tuple<> > at /usr/bin/../lib/gcc/x86_64-redhat-linux/12/../../../../include/c++/12/tuple:1863
  | |     |             (inlined by) seastar::futurize<void>::apply<row_cache::do_update<row_cache::update(row_cache::external_updater, replica::memtable&)::$_21>(row_cache::external_updater, replica::memtable&, row_cache::update(row_cache::external_updater, replica::memtable&)::$_21)::{lambda()#1}::operator()() const::{lambda()#2}> at ././seastar/include/seastar/core/future.hh:1963
  | |     |             (inlined by) operator() at ././seastar/include/seastar/core/thread.hh:258
  | |     | ++[6#1/1 100%] addr=0x56b7576 total=4301 count=706 avg=6:
  | |     | |              seastar::noncopyable_function<void ()>::operator() at ./build/release/seastar/./seastar/include/seastar/util/noncopyable_function.hh:210
  | |     | |              (inlined by) seastar::thread_context::main at ./build/release/seastar/./seastar/src/core/thread.cc:299
  | |     |+[5#2/8 30%] addr=0x1ad2213 total=3660 count=601 avg=6:
  | |     |             logalloc::allocating_section::with_reserve<logalloc::allocating_section::operator()<replica::memtable::apply(frozen_mutation const&, seastar::lw_shared_ptr<schema const> const&, db::rp_handle&&)::$_9::operator()() const::{lambda()#1}>(logalloc::region&, replica::memtable::apply(frozen_mutation const&, seastar::lw_shared_ptr<schema const> const&, db::rp_handle&&)::$_9::operator()() const::{lambda()#1}&&)::{lambda()#1}> at ././utils/logalloc.hh:469
  | |     |             (inlined by) logalloc::allocating_section::operator()<replica::memtable::apply(frozen_mutation const&, seastar::lw_shared_ptr<schema const> const&, db::rp_handle&&)::$_9::operator()() const::{lambda()#1}> at ././utils/logalloc.hh:520
  | |     |             (inlined by) operator() at ./replica/memtable.cc:794
  | |     |             (inlined by) with_allocator<replica::memtable::apply(frozen_mutation const&, seastar::lw_shared_ptr<schema const> const&, db::rp_handle&&)::$_9> at ././utils/allocation_strategy.hh:313
  | |     |             (inlined by) replica::memtable::apply at ./replica/memtable.cc:793
  | |     |+          - addr=0x1a0efda:
  | |     |             replica::table::do_apply<frozen_mutation const&, seastar::lw_shared_ptr<schema const>&> at ./replica/table.cc:2267
  | |     |             (inlined by) operator() at ./replica/table.cc:2290
  | |     |             (inlined by) seastar::futurize<void>::invoke<replica::table::apply(frozen_mutation const&, seastar::lw_shared_ptr<schema const>, db::rp_handle&&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >)::$_43&> at ././seastar/include/seastar/core/future.hh:1996
  | |     | ++[6#1/2 96%] addr=0x1a0f4a3 total=3727 count=612 avg=6:
  | |     | |             replica::dirty_memory_manager_logalloc::region_group::concrete_allocating_function<replica::table::apply(frozen_mutation const&, seastar::lw_shared_ptr<schema const>, db::rp_handle&&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >)::$_43>::allocate at ./replica/dirty_memory_manager.hh:118
  | |     | ++          - addr=0x1b2068d:
  | |     | |             replica::dirty_memory_manager_logalloc::region_group::execute_one at ./replica/dirty_memory_manager.cc:106
  | |     | |             (inlined by) replica::dirty_memory_manager_logalloc::region_group::release_queued_allocations at ./replica/dirty_memory_manager.cc:117

And the stalls look similar to #13160
cc @tgrabiec @michoecho

@michoecho
Copy link
Contributor

The reason of everything else is that something is wrong specifically with shard 0. Somehow it needs more CPU to do the same job as other shards:

image

@michoecho
Copy link
Contributor

It's doesn't appear to be just an infinite loop on shard 0 or something, because the extra load happens both in the write phase and the read phase of the test, but not in between.

@michoecho
Copy link
Contributor

I'll check if it happens locally. If not, it's probably a test setup issue. If yes, I'll bisect it.

@michoecho
Copy link
Contributor

It doesn't happen on my PC with 1da0270.
I'll also try with the AMI.

@michoecho
Copy link
Contributor

michoecho commented Apr 18, 2023

This is a kernel and/or i3.4xlarge regression (edit: I didn't test other instance types). The NIC doesn't report its IRQ numbers for some reason (/sys/class/net/eth0/device/msi_irqs/ is empty), so perftune can't distribute networking IRQs properly (it doesn't do anything about them), and by default they concentrate on the shard 0 CPU. (Don't ask me why).
A manual echo 00010001 | sudo tee /proc/irq/*/smp_affinity fixes the performance problem.

I don't know what we can do about this.

@michoecho
Copy link
Contributor

michoecho commented Apr 18, 2023

my suspicion is that it's something related to compactions, looks like the compactions are not "steady" during the entire write and the effect is severe.

@roydahan Compactions being bursty is an effect, not a cause. Since shards other than 0 are very underloaded, they are able to devote most of their CPU to compactions (normally they would devote a small but smoothly growing fraction of their time), and complete them in a fast burst.

@vladzcloudius
Copy link
Contributor

vladzcloudius commented Apr 18, 2023

This is a kernel and/or i3.4xlarge regression (edit: I didn't test other instance types). The NIC doesn't report its IRQ numbers for some reason (/sys/class/net/eth0/device/msi_irqs/ is empty), so perftune can't distribute networking IRQs properly (it doesn't do anything about them), and by default they concentrate on the shard 0 CPU. (Don't ask me why). A manual echo 00010001 | sudo tee /proc/irq/*/smp_affinity fixes the performance problem.

I don't know what we can do about this.

@aleksbykov @michoecho please, send the following info from the node where you saw the above:

  • ethtool -i eth0
  • uname -a

I strongly doubt there is any kernel issue here.
Much more likely is that Enhanced Networking is not enabled for this VM and a fully virtual NIC device is used.

Based on the perftune output there wasn't any IRQ detected, meaning the NIC in question wasn't exposing either MSI-X, or MSI, or INT#x vectors to the guest OS.
Which is only possible to fully virtualized devices to the best of my knowledge.

@michoecho
Copy link
Contributor

@aleksbykov @michoecho please, send the following info from the node where you saw the above:

ethtool -i eth0

driver: ena
version: 5.19.0-1022-aws
firmware-version:
expansion-rom-version:
bus-info: 0000:00:03.0
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no

uname -a

Linux ip-10-0-1-194 5.19.0-1022-aws #23~22.04.1-Ubuntu SMP Fri Mar 17 15:38:24 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

@vladzcloudius
Copy link
Contributor

Looks legit. How I can get access to that VM?

@michoecho
Copy link
Contributor

michoecho commented Apr 18, 2023

Looks legit. How I can get access to that VM?

It's not some particular VM. Just launch ami-0501eb17c8c79b6d2 (us-east-1) on i3.4xlarge.

Edit: mine is already shut down, so I can't give you access to it to save you the effort.

@vladzcloudius
Copy link
Contributor

vladzcloudius commented Apr 18, 2023

my suspicion is that it's something related to compactions, looks like the compactions are not "steady" during the entire write and the effect is severe.

@roydahan Compactions being bursty is an effect, not a cause. Since shards other than 0 are very underloaded, they are able to devote most of their CPU to compactions (normally they would devote a small but smoothly growing fraction of their time), and complete them in a fast burst.

This statement is incorrect in general, @michoecho.
Compaction will never use a "small smoothly growing fraction of CPU" - they will always use every CPU or other resource available as long as they run. The same relates to any running context in the seastar. Only if some running context is IO bound you can see some "free" CPU while it runs. However this is almost never the case with compactions which are mainly CPU bound. That's why when compactions are running on a specific shard its "load" is always at 100%.

The actual amount of CPU out of those 100% used by compactions depends on other running contexts and relative amount of shares compared to compaction ones.

@michoecho
Copy link
Contributor

my suspicion is that it's something related to compactions, looks like the compactions are not "steady" during the entire write and the effect is severe.

@roydahan Compactions being bursty is an effect, not a cause. Since shards other than 0 are very underloaded, they are able to devote most of their CPU to compactions (normally they would devote a small but smoothly growing fraction of their time), and complete them in a fast burst.

This statement is incorrect in general, @michoecho. Compaction will never use a "small smoothly growing fraction of CPU" - they will always use every CPU or other resource available as long as they run. The same relates to any running context is the seastar. Only if some running context is IO bound you can see some "free" CPU while it runs. However this is almost never the case with compactions which are mainly CPU bound. That's why when compactions are running on a specific shard its "load" is always at 100%.

The actual amount of CPU out of those 100% used by compactions depends on other running contexts and relative amount of shares compared to compaction ones.

Yes. By "normally" I meant "in a regular run of this particular test", not in general.

@mykaul mykaul changed the title Severed degradation in write throughput Severe degradation in write throughput Apr 19, 2023
@mykaul mykaul added the triage/master Looking for assignee label Apr 19, 2023
@mykaul
Copy link
Contributor

mykaul commented Apr 19, 2023

Linux ip-10-0-1-194 5.19.0-1022-aws #23~22.04.1-Ubuntu SMP Fri Mar 17 15:38:24 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Didn't we agree to remain on 5.15? (https://www.omgubuntu.co.uk/2022/01/ubuntu-22-04-lts-will-use-linux-5-15-kernel ) - how did you get 5.19?

@aleksbykov
Copy link
Contributor Author

aleksbykov commented Apr 19, 2023

Latest succesful run is with : Scylla version 5.3.0~dev-0.20230325.e8fb718e4ad4 with build-id 6eed28a1ac2addc02aceea60af4d6ee4acd56955 PASSED ami-078e6867d914fbfb0

Linux perf-regression-latency-ubuntu-db-node-363ae036-1 5.15.0-1031-aws #35-Ubuntu SMP Fri Feb 10 02:07:18 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

scyllaadm@perf-regression-latency-ubuntu-db-node-363ae036-1:~$ ls /sys/class/net/eth0/device/msi_irqs/
210  211  212  213  214  215  216  217  218

scyllaadm@perf-regression-latency-ubuntu-db-node-363ae036-1:~$  ethtool -i eth0
driver: ena
version: 5.15.0-1031-aws
firmware-version: 
expansion-rom-version: 
bus-info: 0000:00:03.0
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no

First failed run is with: Scylla version 5.3.0-dev-0.20230328.c7131a05741d with build-id 6358d7ada913b1dfc96849ddb519b7a243afe0bd FAILED ami-020e718640eafe444

between them we don't have AMI.

for Scylla version 5.3.0 - dev-0.20230328.c7131a05741d with build-id 6358d7ada913b1dfc96849ddb519b7a243afe0bd
instance have:
empty /sys/class/net/eth0/device/msi_irqs/
scyllaadm@perf-regression-latency-ubuntu-db-node-a292366c-1:~$ ethtool -i eth0

driver: ena
version: 5.19.0-1022-aws
firmware-version: 
expansion-rom-version: 
bus-info: 0000:00:03.0
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no

scyllaadm@perf-regression-latency-ubuntu-db-node-a292366c-1:~$ uname -a

Linux perf-regression-latency-ubuntu-db-node-a292366c-1 5.19.0-1022-aws #23~22.04.1-Ubuntu SMP Fri Mar 17 15:38:24 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

@mykaul
Copy link
Contributor

mykaul commented Apr 19, 2023

@yaronkaikov - Ubuntu 22.0.4 LTS kernel should be 5.15, with an optional, for HW enablement, 5.19 - but we've never moved to 5.19 explicitly - how come the AMI uses it?

@mykaul
Copy link
Contributor

mykaul commented Apr 19, 2023

@vladzcloudius - do we have any tests for perftune that can catch such issues?

@yaronkaikov
Copy link
Contributor

@yaronkaikov - Ubuntu 22.0.4 LTS kernel should be 5.15, with an optional, for HW enablement, 5.19 - but we've never moved to 5.19 explicitly - how come the AMI uses it?

Since we are using the latest image available , we don't pin the kernel version

@michoecho
Copy link
Contributor

michoecho commented Apr 19, 2023

@mykaul I guess our AMIs use the linux-aws apt package, which is currently 5.19, instead of linux-aws-lts, which is 5.15.
Edit: it's called linux-aws-lts-22.04, not just linux-aws-lts.

@mykaul
Copy link
Contributor

mykaul commented Apr 19, 2023

@mykaul I guess our AMIs use the linux-aws apt package, which is currently 5.19, instead of linux-aws-lts, which is 5.15.

@yaronkaikov - this might be the issue. We were supposed to keep using LTS.

@michoecho
Copy link
Contributor

@mykaul I guess our AMIs use the linux-aws apt package, which is currently 5.19, instead of linux-aws-lts, which is 5.15.

@yaronkaikov - this might be the issue. We were supposed to keep using LTS.

I just verified that the current linux-aws-lts-22.04 (5.15.0-1034-aws) is not affected by the problem. It reports IRQs as expected.

@avikivity
Copy link
Member

Still, we need to report the problem, so it gets fixed before we're forced to move to a newer kernel. Also users may be using that newer kernel.

@DoronArazii DoronArazii added P1 Urgent status/release blocker Preventing from a release to be promoted labels Apr 19, 2023
@vladzcloudius
Copy link
Contributor

Tested a smaller i4i instance (2xl) - still works:

scyllaadm@ip-10-99-17-182:~$ uname -a
Linux ip-10-99-17-182 5.19.0-1022-aws #23~22.04.1-Ubuntu SMP Fri Mar 17 15:38:24 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
scyllaadm@ip-10-99-17-182:~$ sudo ls -al /sys/devices/pci0000:00/0000:00:05.0/msi_irqs/
total 0
drwxr-xr-x 2 root root    0 Apr 26 14:11 .
drwxr-xr-x 6 root root    0 Apr 26 14:10 ..
-r--r--r-- 1 root root 4096 Apr 26 14:11 36
-r--r--r-- 1 root root 4096 Apr 26 14:11 37
-r--r--r-- 1 root root 4096 Apr 26 14:11 38
-r--r--r-- 1 root root 4096 Apr 26 14:11 39
-r--r--r-- 1 root root 4096 Apr 26 14:11 40
-r--r--r-- 1 root root 4096 Apr 26 14:11 41
-r--r--r-- 1 root root 4096 Apr 26 14:11 42
-r--r--r-- 1 root root 4096 Apr 26 14:11 43
-r--r--r-- 1 root root 4096 Apr 26 14:11 44

@yaronkaikov
Copy link
Contributor

Closed with scylladb/scylla-machine-image#443

@mykaul
Copy link
Contributor

mykaul commented Apr 27, 2023

Closed with scylladb/scylla-machine-image#443

It's not closed, in the sense that we do need/want to support that 5.19 kernel.

@mykaul
Copy link
Contributor

mykaul commented Apr 27, 2023

I posted some analysis on amzn/amzn-drivers#268. In gist, what they wrote makes very little sense. I think they messed something up in this particular kernel assembly.

So perhaps it is an Ubuntu issue (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2016991 ) after all?

@yaronkaikov yaronkaikov removed their assignment Apr 27, 2023
@vladzcloudius
Copy link
Contributor

vladzcloudius commented Apr 27, 2023

I posted some analysis on amzn/amzn-drivers#268. In gist, what they wrote makes very little sense. I think they messed something up in this particular kernel assembly.

So perhaps it is an Ubuntu issue (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2016991 ) after all?

It's definitely not a Ubuntu issue. If it was it would not work on i4i and with older (Amazon!) kernel.

It's quite obviously an Amazon kernel bug. They (Amazon) have to fix it.
We on the other hand should not use their kernels with this bug. In this context it means we should roll back to an older kernel without an issue in our AMIs.

@yaronkaikov @mykaul

@vladzcloudius vladzcloudius reopened this Apr 27, 2023
@vladzcloudius
Copy link
Contributor

Closed with scylladb/scylla-machine-image#443

@yaronkaikov @mykaul Are you sure it's safe to not use Amazon's kernels that AFAIU come with Amazon's vanilla Ubuntu AMIs? Are you sure Ubuntu's LTS kernels are certified by Amazon?

@mykaul
Copy link
Contributor

mykaul commented Apr 27, 2023

Closed with scylladb/scylla-machine-image#443

@yaronkaikov @mykaul Are you sure it's safe to not use Amazon's kernels that AFAIU come with Amazon's vanilla Ubuntu AMIs? Are you sure Ubuntu's LTS kernels are certified by Amazon?

I think it's safe. I also slightly prefer having a single kernel version (as much as possible) across cloud providers (you could ask also about serverless - we have haven't decided yet - https://github.com/scylladb/serverless-issues/issues/11 )

I could not find a 'certified by Amazon', but I assume it's because Ubuntu Pro might be.

@vladzcloudius
Copy link
Contributor

vladzcloudius commented Apr 27, 2023

See here:
https://aws.amazon.com/blogs/aws/use-your-own-kernel-with-amazon-ec2/

They confirm that "Ubuntu EC2 Variant" would work reliably.

Also see this: https://ubuntu.com/blog/introducing-the-ubuntu-aws-rolling-kernel-2

Bottom line, vanilla LTS kernels are not safe on Ec2 AMIs.

There are many reasons for that: there are a few AWS specific device drivers and Xen related bits that are only relevant for Ec2 users. And hence (I guess) they are maintained in a much faster way in this kernel stream than in a mainline LTS stream.

@michoecho
Copy link
Contributor

Closed with scylladb/scylla-machine-image#443

@yaronkaikov @mykaul Are you sure it's safe to not use Amazon's kernels that AFAIU come with Amazon's vanilla Ubuntu AMIs? Are you sure Ubuntu's LTS kernels are certified by Amazon?

Where is your question coming from? Isn't the PR you linked using Amazon's LTS kernel (linux-aws-lts-22.04), not Ubuntu's vanilla LTS kernel?

@vladzcloudius
Copy link
Contributor

Closed with scylladb/scylla-machine-image#443

@yaronkaikov @mykaul Are you sure it's safe to not use Amazon's kernels that AFAIU come with Amazon's vanilla Ubuntu AMIs? Are you sure Ubuntu's LTS kernels are certified by Amazon?

Where is your question coming from? Isn't the PR you linked using Amazon's LTS kernel (linux-aws-lts-22.04), not Ubuntu's vanilla LTS kernel?

It comes from me missing this fact. ;) I assumed a vanilla LTS is used. Thanks for pointing this out, @michoecho.

@yaronkaikov have we verified that a latest aws-lts kernel doesn't suffer from the same bug?

@michoecho
Copy link
Contributor

michoecho commented Apr 27, 2023

Closed with scylladb/scylla-machine-image#443

@yaronkaikov @mykaul Are you sure it's safe to not use Amazon's kernels that AFAIU come with Amazon's vanilla Ubuntu AMIs? Are you sure Ubuntu's LTS kernels are certified by Amazon?

Where is your question coming from? Isn't the PR you linked using Amazon's LTS kernel (linux-aws-lts-22.04), not Ubuntu's vanilla LTS kernel?

It comes from me missing this fact. ;) I assumed a vanilla LTS is used. Thanks for pointing this out, @michoecho.

@yaronkaikov have we verified that a latest aws-lts kernel doesn't suffer from the same bug?

I have (#13560 (comment)), but a sanity check from someone else would be appreciated.

@vladzcloudius
Copy link
Contributor

Closed with scylladb/scylla-machine-image#443

@yaronkaikov @mykaul Are you sure it's safe to not use Amazon's kernels that AFAIU come with Amazon's vanilla Ubuntu AMIs? Are you sure Ubuntu's LTS kernels are certified by Amazon?

Where is your question coming from? Isn't the PR you linked using Amazon's LTS kernel (linux-aws-lts-22.04), not Ubuntu's vanilla LTS kernel?

It comes from me missing this fact. ;) I assumed a vanilla LTS is used. Thanks for pointing this out, @michoecho.
@yaronkaikov have we verified that a latest aws-lts kernel doesn't suffer from the same bug?

I have (#13560 (comment)), but a sanity check from someone else would be appreciated.

First of all - great!
However this GH issue has a 5.19 in its name while the kernel that you've checked in 5.15. So, I'm still not sure if we want to close this GH issue yet.

@yaronkaikov
Copy link
Contributor

Closed with scylladb/scylla-machine-image#443

@yaronkaikov @mykaul Are you sure it's safe to not use Amazon's kernels that AFAIU come with Amazon's vanilla Ubuntu AMIs? Are you sure Ubuntu's LTS kernels are certified by Amazon?

Where is your question coming from? Isn't the PR you linked using Amazon's LTS kernel (linux-aws-lts-22.04), not Ubuntu's vanilla LTS kernel?

It comes from me missing this fact. ;) I assumed a vanilla LTS is used. Thanks for pointing this out, @michoecho.

@yaronkaikov have we verified that a latest aws-lts kernel doesn't suffer from the same bug?

i didn't , it's part of the rc5 so QA will verify it

@mykaul
Copy link
Contributor

mykaul commented May 1, 2023

This specific issue was fixed with 5.2.0-rc5, by ensuring we use kernel 5.15 which doesn't suffer from it (and is truly LTS, btw). There's a separate issue (in seastar, and elsewhere) to track the Linux kernel regression which should be fixed. Therefore, closing as completed.

@mykaul mykaul closed this as completed May 1, 2023
@vladzcloudius
Copy link
Contributor

This specific issue was fixed with 5.2.0-rc5, by ensuring we use kernel 5.15 which doesn't suffer from it (and is truly LTS, btw). There's a separate issue (in seastar, and elsewhere) to track the Linux kernel regression which should be fixed. Therefore, closing as completed.

@mykaul there is nothing to fix in seastar in this context.
A proper place to track AMI kernel related issues is the AMI-related repo (scylla-machine-image ?)

cc @syuu1228

@mykaul
Copy link
Contributor

mykaul commented May 1, 2023

This specific issue was fixed with 5.2.0-rc5, by ensuring we use kernel 5.15 which doesn't suffer from it (and is truly LTS, btw). There's a separate issue (in seastar, and elsewhere) to track the Linux kernel regression which should be fixed. Therefore, closing as completed.

@mykaul there is nothing to fix in seastar in this context. A proper place to track AMI kernel related issues is the AMI-related repo (scylla-machine-image ?)

cc @syuu1228

@vladzcloudius
Copy link
Contributor

vladzcloudius commented May 1, 2023

This specific issue was fixed with 5.2.0-rc5, by ensuring we use kernel 5.15 which doesn't suffer from it (and is truly LTS, btw). There's a separate issue (in seastar, and elsewhere) to track the Linux kernel regression which should be fixed. Therefore, closing as completed.

@mykaul there is nothing to fix in seastar in this context. A proper place to track AMI kernel related issues is the AMI-related repo (scylla-machine-image ?)

* I thought we'd want in perftune to handle the case where we get an empty list.

No, we don't. It's a bogus state of the kernel from our perspective.

* The issue itself is already fixed via [(ami,azure):Install latest LTS kernel during image build  scylla-machine-image#443](https://github.com/scylladb/scylla-machine-image/pull/443) and [(gcp):Install latest LTS kernel during image build scylla-machine-image#446](https://github.com/scylladb/scylla-machine-image/pull/446)

I know (hence not re-opening this one). But what you probably want is to have some tracking for the next time we upgrade the kernel in the AMI.

BTW, upgrading to the "latest LTS" is a risk too as I had explained some time ago.
What you want to do is upgrade to a new version in a controlled manner after you have tested that new kernel in QA - in order to avoid issues like this one.

And to do this you need to pin the kernel.

And when you upgrade you must make sure to upgrade SC installations to that kernel too.

Tomer has recently performed an investigation about kernels' version in SC and it's nothing but terrible.

We really have to take it under the control and upgrading to the "latest LTS" every time you build a new AMI is not helping us to get there...

cc @tomer-sandler @syuu1228

cc @syuu1228

@roydahan roydahan reopened this May 2, 2023
@roydahan
Copy link

roydahan commented May 2, 2023

I reopened the issue because in GCP we still have kernel 5.19 and we need to verify for several types of instances if there is an issue or not.

@vladzcloudius we need your help how to test it and identify in the easiest way where the IRQ is set correctly and where not.

@mykaul
Copy link
Contributor

mykaul commented May 2, 2023

I reopened the issue because in GCP we still have kernel 5.19 and we need to verify for several types of instances if there is an issue or not.

@vladzcloudius we need your help how to test it and identify in the easiest way where the IRQ is set correctly and where not.

The issue does not exist simply because the ENA driver is AWS specific.
To see if the issue is there or not, you can run the performance tests and see if core 0 and such are way too overloaded - just as you've seen in the original issue. I suggest closing this one.

@michoecho
Copy link
Contributor

To see if the issue is there or not, you can run the performance tests and see if core 0 and such are way too overloaded - just as you've seen in the original issue

The overload of shard 0 was an accidental effect, not a direct and predictable result of the bug. So checking for the bug via black box performance tests is a bad idea. (But of course if the issue is AWS-specific then there is no reason to do anything about it for GCP).

@mykaul
Copy link
Contributor

mykaul commented May 2, 2023

To see if the issue is there or not, you can run the performance tests and see if core 0 and such are way too overloaded - just as you've seen in the original issue

The overload of shard 0 was an accidental effect, not a direct and predictable result of the bug. So checking for the bug via black box performance tests is a bad idea. (But of course if the issue is AWS-specific then there is no reason to do anything about it for GCP).

Of course - we can look at the interrupt mapping, which was supposed to, but did not, happen.

@vladzcloudius
Copy link
Contributor

vladzcloudius commented May 2, 2023

I reopened the issue because in GCP we still have kernel 5.19 and we need to verify for several types of instances if there is an issue or not.

@vladzcloudius we need your help how to test it and identify in the easiest way where the IRQ is set correctly and where not.

For this particular issue (a kernel bug) it shows up as an empty content of the
/sys/class/net/<NIC name>/device/msi_irqs/ directory.

When the kernel is working correctly both in AWS and in GCP this directory has to have files with names as IRQs of the corresponding NIC. Like I shown here: #13560 (comment)

I believe all you need to verify in the context of this GH issue is that the above holds.

As to other perftune.py testing - let's have a different context for that (GH issues?) and I'll be happy to help.

@roydahan

@fruch
Copy link
Contributor

fruch commented May 3, 2023

@vponomaryov verified that the IRQs are showing on GCP n1-highmem-8, n1-highmem-16, n1-highmem-32
https://github.com/scylladb/qa-tasks/issues/1145#issuecomment-1533079418

@roydahan roydahan closed this as completed May 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 Urgent status/regression status/release blocker Preventing from a release to be promoted symptom/performance Issues causing performance problems
Projects
None yet
Development

No branches or pull requests

10 participants