New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
oversized allocation during manager repair #6297
Comments
The error happened again almost completely identically in alternator 3h of the same build. The difference is, however, that node#6 did not experience the issue even though it was alive at the time. the backtraces: node#1:
Translated:
node#2:
Translated:
The rest of the nodes, again, suffered the error with very similar backtraces:
node#5:
node#7:
The error caused the connection with node#1 to break:
node list:
logs:
|
It seems the big allocation come from restful api request. |
@asias So is it more of a manager issue? |
@amnonh any idea ? |
My guess is: To solve that (and potential stalls that can follow) the code should be replaced with a streaming reply and using futures. |
The get token range API can become big which can cause large allocation and stalls. This patch replace the implementation so it would stream the results using the http stream capabilities instead of serialization and sending one big buffer. Fixes scylladb#6297 Signed-off-by: Amnon Heiman <amnon@scylladb.com>
" This series changes the describe_ring API to use HTTP stream instead of serializing the results and send it as a single buffer. While testing the change I hit a 4-year-old issue inside service/storage_proxy.cc that causes a use after free, so I fixed it along the way. Fixes #6297 " * amnonh-stream_describe_ring: api/storage_service.cc: stream result of token_range storage_service: get_range_to_address_map prevent use after free
I'd like to backport, but want a confirmation that a run with scylla-manager repair was performed on 7c4562d or later. |
@avikivity
Logs:
|
The get token range API can become big which can cause large allocation and stalls. This patch replace the implementation so it would stream the results using the http stream capabilities instead of serialization and sending one big buffer. Fixes #6297 Signed-off-by: Amnon Heiman <amnon@scylladb.com> (cherry picked from commit 7c4562d)
The get token range API can become big which can cause large allocation and stalls. This patch replace the implementation so it would stream the results using the http stream capabilities instead of serialization and sending one big buffer. Fixes #6297 Signed-off-by: Amnon Heiman <amnon@scylladb.com> (cherry picked from commit 7c4562d)
The get token range API can become big which can cause large allocation and stalls. This patch replace the implementation so it would stream the results using the http stream capabilities instead of serialization and sending one big buffer. Fixes #6297 Signed-off-by: Amnon Heiman <amnon@scylladb.com> (cherry picked from commit 7c4562d)
Backported to 3.3, 4.0, 4.1. |
Installation details
Scylla version (or git commit hash):666.development-0.20200423.fbcf741c2 with build-id cdcc3451a8e1a80d81a647038c0c28c823345d8d
Cluster size:6
OS (RHEL/CentOS/Ubuntu/AWS AMI):ami-0681759688f9cbe67(Ireland)
On the 4 hours 100 GB longevity, during a manager repair, all of the living nodes experienced a seastar_memory oversized allocation issue:
node#7:
Translated backtrace:
Node#1:
Translated backtrace:
The rest of the nodes seem to experience the issue with a very similar, if no identical, backtrace:
node#2
node#3:
node#8:
node#9:
This error caused the connection with two of the nodes, node#1 and node#7, to break due to a broken pipe:
node list:
The job's logs:
The text was updated successfully, but these errors were encountered: