-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hwloc-induced crash on ChromeOS [Scylla terminated by SIGFPE on startup on ChromeOS] #10439
Comments
Please share the output of |
Ah, you already noticed it's hwloc related. Anyway, run it from both host and container to see what stands out. |
Hi @avikivity, Running from both the host and the container results in Dmesg
My colleague was able to repo with hwloc-ps as well. It looks like the issue might be in sched_setaffinity as the following strace seems to indicate:
|
Ah well here's something interesting. Reading through open-mpi/hwloc#525, I decided to try their debug steps. HOST
root@21cbe3bdc341:/# lstopo-no-graphics --version |
I think we've found a potential root cause of the issue: hwloc has released a workaround as of 2.7.1: open-mpi/hwloc@33b555b I'm going to test using the newest release of hwloc |
@avikivity,
I'm not sure if you would consider upgrading the version of hwloc you have, if not I'll plan on creating a custom container for the moment until ChromeOS fixes their bug. |
We source our hwloc from Fedora. So two things need to happen: Fedora upgrades their hwloc (you can help it out by filing a Fedora bug) and we update our baseline Fedora version (currently blocked on compiler problems, but we are working on fixing them). So it's good you have a workaround for the near term. |
Currently there's a chrome bug https://bugs.chromium.org/p/chromium/issues/detail?id=1304418 that causes the crostini vm to send invalid hardware information. This bug causes scylla to crash on startup with a SIGFPE scylladb/scylladb#10439. Since hwloc has already issues a patch to deal with this, we're replacing the dynamically linked lib scylladb uses with the patched version
I've confirmed that the workaround works. |
This is Scylla's bug tracker, to be used for reporting bugs only.
If you have a question about Scylla, and not a bug, please ask it in
our mailing-list at scylladb-dev@googlegroups.com or in our slack channel.
Installation details
Scylla version (or git commit hash): Tested on 4.5.0, 4.5.4, 4.6.0 , 5.0.rc3
Cluster size: 1
OS (RHEL/CentOS/Ubuntu/AWS AMI): Debian Bullseye on ChromeOS
Hardware details (for performance issues) Delete if unneeded
Platform (physical/VM/cloud instance type/docker):
Hardware: sockets= cores= hyperthreading= memory=
Disks: (SSD/HDD, count)
Right now we're running scylla in a docker container, which is itself running in an lxc container, which itself is running on a very light VM.
Summary
When attempting to run dockerized scylla on ChromeOS with Debian Bullseye, Scylla fails with a SIGFPE error during startup. This appears to be because Scylla is using hwloc to query hardware information and then pin resources to threads. This seems to fail when 0 is returned unexpectedly from hwloc so we end up with a divide by 0 error.
Repro Steps
docker run -it scylladb/scylla
In the logs you'll see the following:
Further debug info
From looking at dmesg we get the following
Via Strace
The text was updated successfully, but these errors were encountered: