Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disk_irq tuner is broken on i3 instances after kernel upgrade #10838

Closed
travisdowns opened this issue May 17, 2023 · 0 comments · Fixed by #10864
Closed

disk_irq tuner is broken on i3 instances after kernel upgrade #10838

travisdowns opened this issue May 17, 2023 · 0 comments · Fixed by #10864
Labels
kind/bug Something isn't working

Comments

@travisdowns
Copy link
Member

Version & Environment

Redpanda version: 22.3.1 (and others)

What went wrong?

disk_irq tuner may fail with a message like so:

disk_irq               false    true     true       err=signal: segmentation fault (core dumped), stderr=

What should have happened instead?

The problem as pointed out by @r-vasquez is that the msi_irqs file is empty for this PCI-E device despite using msi IRQs:

18:59:03.759  DEBUG  Device '/sys/devices/pci0000:00/0000:00:1e.0' uses MSI IRQs
18:59:03.759  DEBUG  DeviceInfo '/sys/devices/pci0000:00/0000:00:1e.0' IRQs '[]'

This causes us to pass 0 to hwloc-distrib:

calling hwloc-distrib-redpanda' with arguments '[0 --single --restrict 0x000000ff]'

which crashes with an assert failure.

How to reproduce the issue?

As above.

Additional information

This seems to be a bug introduced somewhere in kernel version 5.16 to 5.19 as described here:

amzn/amzn-drivers#268

It is patched upstream and will presumably show up in some later kernel:

amzn/amzn-drivers#268 (comment)

In the meantime we could just fail the irq_tuner when we detect this with a better error message?

@travisdowns travisdowns added the kind/bug Something isn't working label May 17, 2023
r-vasquez added a commit to r-vasquez/redpanda that referenced this issue May 18, 2023
Due to a known error introduced in kernel 5.17
some instances that use MSI IRQ can have the list
of IRQs empty in sysfs which makes the tuner fail
and make hwloc seg fault (impossible to tune).

This change only prints a warning instead of an
error and links to the issue where explains
what is the problem and the upstream reports.

Fixes redpanda-data#10838
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant