New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Coredump during read-only workload #3830
Comments
Seems like an issue in |
I'm looking at it. |
@glommer are scans timing out? Is the |
One possible explanation: when stopping a shard reader fails it is left in |
Regardless of whether this is the underlying cause, I'll send a patch for this, as it might as well cause an assert failure like this. |
Currently, when stopping a reader fails, it simply won't be attempted to be saved, and it will be left in the `_readers` array as-is. This can lead to an assertion failure as the reader state will contain futures that were already waited upon, and that the cleanup code will attempt to wait on again. To prevent this, when stopping a reader fails, reset it to nonexistent state, so that the cleanup code doesn't attempt to do anything with it. Refs: #3830 Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <a1afc1d3d74f196b772e6c218999c57c15ca05be.1539088164.git.bdenes@scylladb.com>
After a discussion with @glommer it seems that indeed this is the case. I did not expect to see values this high for these counters, but I guess we will see numbers like this on severely overloaded nodes. Lesson learned: badness counters are good! :) |
I sent a patch that should solve this, once it's in I we can close this issue. |
Fix was commited as d467b51. This can be closed now. |
Currently, when stopping a reader fails, it simply won't be attempted to be saved, and it will be left in the `_readers` array as-is. This can lead to an assertion failure as the reader state will contain futures that were already waited upon, and that the cleanup code will attempt to wait on again. To prevent this, when stopping a reader fails, reset it to nonexistent state, so that the cleanup code doesn't attempt to do anything with it. Refs: #3830 Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <a1afc1d3d74f196b772e6c218999c57c15ca05be.1539088164.git.bdenes@scylladb.com> (cherry picked from commit d467b51)
I am running the 3.0 branch with patches from avi on top (per-user SLA).
However, it doesn't seem to me that this coredump has anything to do with per-user SLA and seem like it's happening because of 3.0 code
The coredump happens when I am running a cassandra-stress command (from the 3.0 scylla-tools branch, that support fixed mode - so far it doesn't seem to reproduce without it)
The command is:
In parallel to that, I am doing a full table scan with high parallelism. The parallelism is high enough that some of the cassandra-stress queries time out.
After a couple of them timeout, Scylla crashes.
I have the coredump and access to the box if anyone wants to take a look
The text was updated successfully, but these errors were encountered: