-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ReadTimeoutException error on client during performing multiple Add/Drop columns commands during long period of time #15312
Comments
@xemul - can you please take a look? |
io-properties:
So disk is able ~{700Mb/s, 100kiops} reads and ~{500Mb/s, 60kiops} writes. Commitlog consumes 200Mbs / 2kiops (writes) That's on the edge of disk abilities, it just cannot (well -- not allowed to by scheduler) do more. |
scylladb/seastar#1641 still applies, though, and it does makes things worse |
And the last worrisome thing is IO starvation times They are pretty big, with huge queue length values it should've been much smaller. But it can be due to scylladb/seastar#1641 |
@temichus please re-try with the latest master, the problem should be addressed with Kamil's fix for schema versioning. |
@temichus also please start Scylla with |
the issue was reproduced (5 runs 4 fails) Installation detailsKernel Version: 5.15.0-1045-aws Cluster size: 3 nodes (i4i.xlarge) Scylla Nodes used in this run:
OS / Image: Test: Logs and commands
Logs:
|
I'm looking at the logs. One thing I noticed is that you don't provide original Scylla node logs with millisecond resolution timestamps like this:
the milliseconds would be useful for my investigation here TBH... but maybe I'll find something out without it |
Why don't the metrics have read timeout data? |
May be because we do not sent zero metrics any longer? |
Also the stressor app doesn't provide timestamps when the timeouts happened. Very convenient. |
Lots of messages like these around this time:
looks like these reads are happening during schema merging. cc @tgrabiec |
That's only on the coordinator node though, and the read requires a quorum. Perhaps the two other nodes are waiting on barriers. Indeed I found some very long running barriers on 10.4.3.118 during this time:
That's 15 seconds to do a read barrier! I didn't find such long barrier on the third node (10.4.2.192), so theoretically we should have a QUORUM... But who knows, the situation is complex when we consider speculative reads. Maybe the metric used to perform the speculative read was already high, we picked |
A read barrier latency metric could be useful perhaps... |
Do we want to print in the log how long it took to sync the schema? (instead / in addition to a metric? I think a log entry is more useful?) |
Try looking at |
I'm trying to build the metric so it shows it for every node, but apparently I cannot into grafana |
That is sort of expected for such a workload. We could improve it by expiring tombstones sooner with raft, by using tombstone gc mode = repair, and treating state machine catch up as repair. |
Never heard of such databases. Everyone have their tricks to overcome this 'when and how frequently I sync'. |
Are you saying that no databases give actual durability guarantees? |
I know we have commitlog metrics of course. But now we have two commitlogs, so my question is can we see metrics for each one separately. One for the statement may starve, but one for the raft/schema/local tables can be fine. |
Example - https://docs.portworx.com/portworx-enterprise/concepts/io-profiles#the-db_remote-profile |
I restored monitor stack again, but it looks like there is no data? |
There is data for 23/09/06 |
Damn it, I restored monitor from the original post. |
We need those raft=trace logs |
According to @xemul --- it could still be that we're waiting for the scheduler, scylladb/seastar#1641 but I don't know how to determine if that's the case |
I opened #15566 |
here it is: run with append_scylla_args: --logger-log-level migration_manager=trace:group0_raft_sm=trace:raft=trace Installation detailsKernel Version: 5.15.0-1045-aws Cluster size: 3 nodes (i4i.xlarge) Scylla Nodes used in this run:
OS / Image: Test: Logs and commands
Logs:
|
I'm running the test with modified Scylla with additional/modified logging: |
* seastar bab1625c...17183ed4 (73): > thread_pool: Reference reactor, not point to > sstring: inherit publicly from string_view formatter > circleci: use conditional steps > weak_ptr: include used header > build: disable the -Wunused-* warnings for checkheaders > resource: move variable into smaller lexical scope > resource: use structured binding when appropriate > httpd: Added server and client addresses to request structure > io_queue: do not dereference moved-away shared pointer > treewide: explicitly define ctor and assignment operator > memory: use `err` for the error string > doc: Add document describing all the math behind IO scheduler > io_queue: Add flow-rate based self slowdown backlink > io_queue: Make main throttler uncapped > io_queue: Add queue-wide metrics > io_queue: Introduce "flow monitor" > io_queue: Count total number of dispatched and completed requests so far > io_queue: Introduce io_group::io_latency_goal() > tests: test the vector overload for when_all_succeed > core: add a vector overload to when_all_succeed > loop: Fix iterator_range_estimate_vector_capacity for random iters > loop: Add test for iterator_range_estimate_vector_capacity > core/posix return old behaviour using non-portable pthread_attr_setaffinity_np when present > memory: s/throw()/noexcept/ > build: enable -Wdeprecated compiler option > reactor: mark kernel_completion's dtor protected > tests: always wait for promise > http, json, net: define-generated copy ctor for polymorphic types > treewide: do not define constexpr static out-of-line > reactor: do not define dtor of kernel_completion > http/exception: stop using dynamic exception specification > metrics: replace vector with deque > metrics: change metadata vector to deque > utils/backtrace.hh: make simple_backtrace formattable > reactor: Unfriend disk_config_params > reactor: Move add_to_flush_poller() to internal namespace > reactor: Unfriend a bunch of sched group template calls > rpc_test: Test rpc send glitches > net: Implement batch flush support for existing sockets > iostream: Configure batch flushes if sink can do it > net: Added remote address accessors > circleci: update the image to CircleCI "standard" image > build: do not add header check target if no headers to check > build: pass target name to seastar_check_self_contained > build: detect glibc features using CMake > build: extract bits checking libc into CheckLibc.cmake > http/exception: add formatter for httpd::base_exception > http/client: Mark write_body() const > http/client: Introduce request::_bytes_written > http/client: Mark maybe_wait_for_continue() const > http/client: Mark send_request_head() const > http/client: Detach setup_request() > http/api_docs: copy in api_docs's copy constructor > script: do not inherit from object > scripts: addr2line: change StdinBacktraceIterator to a function > scripts: addr2line: use yield instead defining a class > tests: skip tests that require backtrace if execinfo.h is not found > backtrace: check for existence of execinfo.h > core: use ino_t and off_t as glibc sets these to 64bit if 64bit api is used > core: add sleep_abortable instantiation for manual_clock > tls: Return EPIPE exception when writing to shutdown socket > http/client: Don't cache connection if server advertises it > http/client: Mark connection as "keep in cache" > core: fix strerror_r usage from glibc extension > reactor: access sigevent.sigev_notify_thread_id with a macro > posix: use pthread_setaffinity_np instead of pthread_attr_setaffinity_np > reactor: replace __mode_t with mode_t > reactor: change sys/poll.h to posix poll.h > rpc: Add unit test for per-domain metrics > rpc: Report client connections metrics > rpc: Count dead client stats > rpc: Add seastar::rpc::metrics > rpc: Make public queues length getters io-scheduler fixes refs: scylladb#15312 refs: scylladb#11805 http client fixes refs: scylladb#13736 refs: scylladb#15509 rpc fixes refs: scylladb#15462
* seastar bab1625c...17183ed4 (73): > thread_pool: Reference reactor, not point to > sstring: inherit publicly from string_view formatter > circleci: use conditional steps > weak_ptr: include used header > build: disable the -Wunused-* warnings for checkheaders > resource: move variable into smaller lexical scope > resource: use structured binding when appropriate > httpd: Added server and client addresses to request structure > io_queue: do not dereference moved-away shared pointer > treewide: explicitly define ctor and assignment operator > memory: use `err` for the error string > doc: Add document describing all the math behind IO scheduler > io_queue: Add flow-rate based self slowdown backlink > io_queue: Make main throttler uncapped > io_queue: Add queue-wide metrics > io_queue: Introduce "flow monitor" > io_queue: Count total number of dispatched and completed requests so far > io_queue: Introduce io_group::io_latency_goal() > tests: test the vector overload for when_all_succeed > core: add a vector overload to when_all_succeed > loop: Fix iterator_range_estimate_vector_capacity for random iters > loop: Add test for iterator_range_estimate_vector_capacity > core/posix return old behaviour using non-portable pthread_attr_setaffinity_np when present > memory: s/throw()/noexcept/ > build: enable -Wdeprecated compiler option > reactor: mark kernel_completion's dtor protected > tests: always wait for promise > http, json, net: define-generated copy ctor for polymorphic types > treewide: do not define constexpr static out-of-line > reactor: do not define dtor of kernel_completion > http/exception: stop using dynamic exception specification > metrics: replace vector with deque > metrics: change metadata vector to deque > utils/backtrace.hh: make simple_backtrace formattable > reactor: Unfriend disk_config_params > reactor: Move add_to_flush_poller() to internal namespace > reactor: Unfriend a bunch of sched group template calls > rpc_test: Test rpc send glitches > net: Implement batch flush support for existing sockets > iostream: Configure batch flushes if sink can do it > net: Added remote address accessors > circleci: update the image to CircleCI "standard" image > build: do not add header check target if no headers to check > build: pass target name to seastar_check_self_contained > build: detect glibc features using CMake > build: extract bits checking libc into CheckLibc.cmake > http/exception: add formatter for httpd::base_exception > http/client: Mark write_body() const > http/client: Introduce request::_bytes_written > http/client: Mark maybe_wait_for_continue() const > http/client: Mark send_request_head() const > http/client: Detach setup_request() > http/api_docs: copy in api_docs's copy constructor > script: do not inherit from object > scripts: addr2line: change StdinBacktraceIterator to a function > scripts: addr2line: use yield instead defining a class > tests: skip tests that require backtrace if execinfo.h is not found > backtrace: check for existence of execinfo.h > core: use ino_t and off_t as glibc sets these to 64bit if 64bit api is used > core: add sleep_abortable instantiation for manual_clock > tls: Return EPIPE exception when writing to shutdown socket > http/client: Don't cache connection if server advertises it > http/client: Mark connection as "keep in cache" > core: fix strerror_r usage from glibc extension > reactor: access sigevent.sigev_notify_thread_id with a macro > posix: use pthread_setaffinity_np instead of pthread_attr_setaffinity_np > reactor: replace __mode_t with mode_t > reactor: change sys/poll.h to posix poll.h > rpc: Add unit test for per-domain metrics > rpc: Report client connections metrics > rpc: Count dead client stats > rpc: Add seastar::rpc::metrics > rpc: Make public queues length getters io-scheduler fixes refs: scylladb#15312 refs: scylladb#11805 http client fixes refs: scylladb#13736 refs: scylladb#15509 rpc fixes refs: scylladb#15462
* seastar bab1625c...17183ed4 (73): > thread_pool: Reference reactor, not point to > sstring: inherit publicly from string_view formatter > circleci: use conditional steps > weak_ptr: include used header > build: disable the -Wunused-* warnings for checkheaders > resource: move variable into smaller lexical scope > resource: use structured binding when appropriate > httpd: Added server and client addresses to request structure > io_queue: do not dereference moved-away shared pointer > treewide: explicitly define ctor and assignment operator > memory: use `err` for the error string > doc: Add document describing all the math behind IO scheduler > io_queue: Add flow-rate based self slowdown backlink > io_queue: Make main throttler uncapped > io_queue: Add queue-wide metrics > io_queue: Introduce "flow monitor" > io_queue: Count total number of dispatched and completed requests so far > io_queue: Introduce io_group::io_latency_goal() > tests: test the vector overload for when_all_succeed > core: add a vector overload to when_all_succeed > loop: Fix iterator_range_estimate_vector_capacity for random iters > loop: Add test for iterator_range_estimate_vector_capacity > core/posix return old behaviour using non-portable pthread_attr_setaffinity_np when present > memory: s/throw()/noexcept/ > build: enable -Wdeprecated compiler option > reactor: mark kernel_completion's dtor protected > tests: always wait for promise > http, json, net: define-generated copy ctor for polymorphic types > treewide: do not define constexpr static out-of-line > reactor: do not define dtor of kernel_completion > http/exception: stop using dynamic exception specification > metrics: replace vector with deque > metrics: change metadata vector to deque > utils/backtrace.hh: make simple_backtrace formattable > reactor: Unfriend disk_config_params > reactor: Move add_to_flush_poller() to internal namespace > reactor: Unfriend a bunch of sched group template calls > rpc_test: Test rpc send glitches > net: Implement batch flush support for existing sockets > iostream: Configure batch flushes if sink can do it > net: Added remote address accessors > circleci: update the image to CircleCI "standard" image > build: do not add header check target if no headers to check > build: pass target name to seastar_check_self_contained > build: detect glibc features using CMake > build: extract bits checking libc into CheckLibc.cmake > http/exception: add formatter for httpd::base_exception > http/client: Mark write_body() const > http/client: Introduce request::_bytes_written > http/client: Mark maybe_wait_for_continue() const > http/client: Mark send_request_head() const > http/client: Detach setup_request() > http/api_docs: copy in api_docs's copy constructor > script: do not inherit from object > scripts: addr2line: change StdinBacktraceIterator to a function > scripts: addr2line: use yield instead defining a class > tests: skip tests that require backtrace if execinfo.h is not found > backtrace: check for existence of execinfo.h > core: use ino_t and off_t as glibc sets these to 64bit if 64bit api is used > core: add sleep_abortable instantiation for manual_clock > tls: Return EPIPE exception when writing to shutdown socket > http/client: Don't cache connection if server advertises it > http/client: Mark connection as "keep in cache" > core: fix strerror_r usage from glibc extension > reactor: access sigevent.sigev_notify_thread_id with a macro > posix: use pthread_setaffinity_np instead of pthread_attr_setaffinity_np > reactor: replace __mode_t with mode_t > reactor: change sys/poll.h to posix poll.h > rpc: Add unit test for per-domain metrics > rpc: Report client connections metrics > rpc: Count dead client stats > rpc: Add seastar::rpc::metrics > rpc: Make public queues length getters io-scheduler fixes refs: #15312 refs: #11805 http client fixes refs: #13736 refs: #15509 rpc fixes refs: #15462 Closes #15774
@kostja, @kbr-scylla seastar_io_queue_delay even higher: but Cassandra stress has no issue: IOException
|
Thanks @temichus. I'm aware that it's still failing. There are still some optimizations that have to be done inside The biggest issue was fixed (#15622) so it shouldn't have a large impact in real-world scenarios, where schema changes are rare. |
@kostja I can tell you right now that it won't be done for 6.0. |
The part we wanted for 6.0 is in, so thank you. |
Issue description
new longevity test that performs only new add/drop column nemesis PR: scylladb/scylla-cluster-tests#6530
longevity test config:
Impact
after about 2 hours of running, seastar_io_queue_delay in more than 1c and client the client gets error:
according to @gleb-cloudius "It should be million times smaller"
How frequently does it reproduce?
at 5 runs on i4i.xlarge nodes, 4 fails 1 pass
seastar_io_queue_delay in passed run(4bc31d25-f6c4-4c39-95f1-b0f5e95db8b9)
at runs on i3en.large - 100% fails (according to @gleb-cloudius : "The instance has small memory (16G) so it probably flashes memtables a lot.")
Comments from email corresponding with developers
@xemul:
"There's a known issue found in the scheduler (seastar#1641, and a work
around discussion at seastar#1766) so they could be better. Starvation
time is too large, but it can be due to that issue."
Installation details
Kernel Version: 5.15.0-1043-aws
Scylla version (or git commit hash):
5.4.0~dev-20230904.9a3d57256a60
with build-id4c60ca8c4afe862855242a77b77b35a377e373dc
Cluster size: 3 nodes (i4i.xlarge)
Scylla Nodes used in this run:
OS / Image:
ami-09a5684fbf0955c84
(aws: undefined_region)Test:
longevity-schema-changes-3h-test
Test id:
527c69a1-d03d-42ba-9b38-eba7b4e782c8
Test name:
scylla-staging/artsiom_mishuta/longevity-schema-changes-3h-test
Test config file(s):
Logs and commands
$ hydra investigate show-monitor 527c69a1-d03d-42ba-9b38-eba7b4e782c8
$ hydra investigate show-logs 527c69a1-d03d-42ba-9b38-eba7b4e782c8
Logs:
Jenkins job URL
Argus
The text was updated successfully, but these errors were encountered: