-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance regression on I3en with Linux 5.8 #7036
Comments
I don't see ami-04983b862895e0534 in Ireland. Which region is it in? |
Also, what instance type was used? |
Couldn't find an the exact AMI ID, but a similar one:
has Linux 5.8.1:
This is consistent with other reports with problems in 5.8 kernels. |
Sorry, forgot to mentioned that it on eu-north-1 region. |
Instance type was : i3en.3xlarge |
@aleksbykov do you have monitor? |
@roydahan instances were removed. but i think i can restore it |
This is consistent with the other reports that mentioned i3en+5.8 |
No need to restore it. |
Linux kernel version from the reported AMI ID: |
|
…ws i3en ref scylladb/scylladb#7036 Pin down kernel version used for aws image till the performance regression is fixed Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
…ws i3en ref scylladb/scylladb#7036 Pin down kernel version used for aws image till the performance regression is fixed Signed-off-by: Shlomi Livne <shlomi@scylladb.com>
Please note this bug doesn't reproduce every time - for us its 1 out of 4 runs (so you may need to retry provisioning a scylla node till you hit the bad case) Simple reproducer
On the loader node run the following command
If the performance is good you should see
The second column is operation per second - 200K is ok (the loader is capped out) In a bad case you will get something like
So in this run the operations per second is less than 2K. In the bad case doing top on the scylla node shows
|
…ws i3en ref scylladb/scylladb#7036 Pin down kernel version used for aws image till the performance regression is fixed Signed-off-by: Shlomi Livne <shlomi@scylladb.com> (cherry picked from commit 7084433)
…ws i3en ref scylladb/scylladb#7036 Pin down kernel version used for aws image till the performance regression is fixed Signed-off-by: Shlomi Livne <shlomi@scylladb.com> (cherry picked from commit 7084433)
Based on what @amoskong reported on the enterprise issue, in order to reproduce we need a cluster with at least one node that has the “bad” kernel. 5.8.1-1.el7.elrepo.x86_64: Good According to @amoskong he couldn’t find which commit in 5.8.1 fixes the issue, but the issue is fixed. |
There is nothing relevant in 5.8.1, please try again. |
I do reproduce the issue with 5.8.1. Before retest, I cleaned the data of cluster. Reproduce scenario:
|
I didn't reproduce the problem with kernel 5.7.0 , there are 17595 commits from 5.7 to 5.8
|
I only successfully tested 5.8-rc4, the issue can be reproduced. The built rpm can't be installed to centos7, and no err message from rpm and yum. I have to install the kernel and modules by make install. It's easy to make the instance to be dead after restart. I can saw an error ofefi/libstub in build 5.8-rc1/5.8-rc2/5.8-rc3, so I disalbed CONFIG_EFI_STUB=n to workaround the build error. But the instance can't boot up after restart.
|
This commit should fix it: 5435f73d5c4a1b7504356876e69ba52de83f4975 |
Most likely, the problem was introduced in -rc1 (where most of the patches are). |
I think we should bisect it, it's the only sure way of arriving at the root cause. |
I have reproduced the issue on 5.8.0 and 5.8.1 . Sometimes it requires several (5-8) restarts of the database in order to reproduce the issue. Usually the DB works fine after a machine reboot. My bisection is pointing in the direction of the commit: 633260fa143bbed05e65dc557a492667dfdc45bb |
Wow, I would have assumed that patch would either work perfectly or crash very early after boot, since it's touching such a senstive area. Any idea how it's broken? Thanks for the update. |
I didn't reproduce the problem with above commit. 633260fa143bbed05e65dc557a492667dfdc45b~1 + [PATCH] efi/x86: Fix build with gcc 4
633260fa143bbed05e65dc557a492667dfdc45b + [PATCH] efi/x86: Fix build with gcc 4
What's your test scnario? |
Have you restarted the db several times?
I'm using the ami provided running and on an second node: Let me get the build info for you. |
The bisection there is a bit messy since multiple commits got introduced with a branch merge and several ended up in kernel panics. Currently I'm trying to run 633260fa143bbed05e65dc557a492667dfdc45b~1 and see if that is good |
Hi Robert @avikivity , In my latest test, I do verified that 633260fa143bbed05e65dc557a492667dfdc45bb introduced the problem.
Yes. But there are other factors affects the result. I just found if cassandra-stress workload is executed too early before the second node is UP, the issue can't be reproduced. The upstream tree I used to bisect: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Latest result: 633260fa143bbed05e65dc557a492667dfdc45b~1 + [PATCH] efi/x86: Fix build with gcc 4
633260fa143bbed05e65dc557a492667dfdc45b + [PATCH] efi/x86: Fix build with gcc 4
|
Thanks for verifying @amoskong ! I'll look into that commit and see what going on. |
@RobertKettler it's worthwhile to report to lkml, please copy @amoskong and me if you do. |
Pretty amazing failure. |
Alexander Graf mentioned this issue in LKML thread
|
Queued for 5.9.0, 5.8.6 |
5.8.6 was released. @slivne please undo the kernel pinning. |
@penberg - reassigning this to you guys - to remove the changes. |
@penberg ping |
1 similar comment
@penberg ping |
@avikivity We never pinned the kernel for open source releases. I queued reverts for enterprise, though. |
Yes we did, example: scylladb/scylla-machine-image@e22898a |
@avikivity Thanks, not sure why I missed them. Reverted the pinning from 4.0, 4.1, and 4.2 machine-image. |
@penberg what about 4.3? and master? |
@avikivity I can't find that pinning in |
(and therefore, obviously not in |
Is this issue resolved? Is it safe to use ElRepo's ML 5.9.10 kernels? |
There's also a centos-plus kernel that is 4.18 but seems to have far more backports than the stock kernel: |
This issue is closed - for reference the issue was 5.8.X and fixed in 5.8.6 the patch was https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?id=e027fffff799cdd70400c5485b1a54f482255985 We reverted the pinning of kernel (we did not add a restriction for not using the bad kernels) |
Installation details
Scylla version (or git commit hash): version 4.2.rc2-0.20200811.b052f3f5cea
Cluster size: 3
OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-04983b862895e0534
Job longevity-8h-large-num-columns was failed.
gemini command
the command generated table with 9 partition keys and 16 clustering keys and 80 columns.
on node3 scylla was stopped, several tables were removed:
After that scylla was started again. The start took about 20 minutes.
During scylla initialization detected a lot of reactor stalls with delays up to 14400ms and messages:
Decoded backtrace:
Decoded backtrace:
Decoded backtrace:
Decoded backtrace:
Finally scylla was initialized:
Gemini generate next schema:
The text was updated successfully, but these errors were encountered: