-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
coredump during c-s read performance test #2021
Comments
System is still alive and marked as keep=alive in case someone want to drill down. |
Please copy the core somewhere instead.
|
Hi Roy
Instruction for core dumps are here I think.
https://github.com/scylladb/scylla/wiki/How-to-report-a-Scylla-problem
…On Wed, Jan 18, 2017 at 7:31 PM, Avi Kivity ***@***.***> wrote:
Please copy the core somewhere instead.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#2021 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAA5IZSVInKblGxJxYrWnFUUzv5oWZW_ks5rTlqfgaJpZM4LnOsN>
.
|
… On Wed, Jan 18, 2017 at 8:12 PM, Benoît Canet ***@***.***> wrote:
Hi Roy
Instruction for core dumps are here I think.
https://github.com/scylladb/scylla/wiki/How-to-report-a-Scylla-problem
On Wed, Jan 18, 2017 at 7:31 PM, Avi Kivity ***@***.***>
wrote:
> Please copy the core somewhere instead.
>
> —
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly, view it on GitHub
> <#2021 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AAA5IZSVInKblGxJxYrWnFUUzv5oWZW_ks5rTlqfgaJpZM4LnOsN>
> .
>
|
Already uploaded it. |
@roydahan How do I extract the core file from the fragments? I joined the 4 fragments using cat but pigz refuses to decompress the result:
|
Better backtrace:
Looks like later() throws an exception, probably because task allocation fails. We abort because region_group::release_requests() is noexcept. It has to be noexcept, it's called in the free path. Here, from a reclaimer. |
So, we're likely low on memory when this happens. Do we have logs/metrics for this run? |
Allocating from reclaimer is dangerous, since there is no guarantee that the allocator will be able to satisfy the allocation without reclaiming (and we can't reenter the reclaimer). This can happen for example if the small pool for the allocation is out of spare objects and needs to do a large allocation to repopulate. I think the code calling later() should be refactored to use alloc-free waking instead. |
We should have the Prometheus's DB in the Jenkins job. I can upload it later if you need it. |
@roydahan Please do, or just tell me where to find it. |
Promethues data attached below: |
BTW, two more observations:
|
@roydahan Are those metrics from the same run? They end at 14:24 whereas from the log excerpt you pasted it looks like the crash was at 15:02. I also do not see the IP from the log (54.173.233.146) in the instance set from the metrics recording: |
Sorry @tgrabiec, you are correct, it's not from the same run but one run earlier. |
I finally got the core dump. The bug is directly triggered by the fact that the pending task buffer needs to grow above 64K elements and memory reclamation is needed and it is satisfied by compacting memtable segments. The bug exists on 1.3, but it could be that 1.5 made those conditions more likely for some reason to occur. Because the reclaimer tries to schedule a task using later(), it is bound to fail because circular_buffer::maybe_expand() is re-entered. There is enough of free memory otherwise (200MB), but only in <= 256K chunks. Action items:
|
There could be another bug lurking here, which is causing large number of tasks to be scheduled in the first place. Will dig a bit more. |
Pending task queue looks like this:
One problem is that there is a lot of tasks associated with
We shouldn't have as much as 16K pending writes at a time given loader concurrency, so I presume that the writes started to time out and accumulate. The fix for this (5ea235e) was not yet in 1.5 branch. @roydahan Do we have prometheus metrics for this run? Another potential problem is that we call |
There are traces of write timeouts on this node:
|
32k writes, even if each write is 1k, should still not overwhelm a shard. It's just 32M. We need to improve in this area, but I don't think this is the root cause here. |
@tgrabiec a possible improvement for the task queue is to allocate a fixed size vector, and also support a singly-linked-list as part of the task structure. Try to enqueue to the fixed ring buffer, if it fails, use the linked list. This will give good performance when not stressed, and allocation-free operation when stressed. |
@avikivity The fact that we have 32K tasks means that writes are timing out as the client concurrency is way lower than that. When we reach 32K tasks and there are writes blocked on dirty, we hit the reclaimer bug resulting in a crash. But this bug is there since 1.3 so this suggests that on earlier versions we had less blocking of writes (and/or also less pending tasks due to less timeouts). I'm still investigating why writes are blocking so much. So far the problem seems to be related to CPU being saturated. |
On previous versions the flush logic for memtables was different - so we
probably flushed "earlier" in the case of a single keyspace.table
…On Thu, Jan 26, 2017 at 8:37 PM, Tomasz Grabiec ***@***.***> wrote:
@avikivity <https://github.com/avikivity> The fact that we have 32K tasks
means that writes are timing out as the client concurrency is way lower
than that. When we reach 32K tasks and there are writes blocked on dirty,
we hit the reclaimer bug resulting in a crash. But this bug is there since
1.3 so this suggests that on earlier versions we had less blocking of
writes (and/or also less pending tasks due to less timeouts).
I'm still investigating why writes are blocking so much. So far the
problem seems to be related to CPU being saturated.
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#2021 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADThCNO4V_wI25bW5AdHziLQm-4_g07hks5rWOfygaJpZM4LnOsN>
.
|
Before, the logic for releasing writes blocked on dirty worked like this: 1) When region group size changes and it is not under pressure and there are some requests blocked, then schedule request releasing task 2) request releasing task, if no pressure, runs one request and if there are still blocked requests, schedules next request releasing task If requests don't change the size of the region group, then either some request executes or there is a request releasing task scheduled. The amount of scheduled tasks is at most 1, there is a single thread of excution. However, if requests themselves would change the size of the group, then each such change would schedule yet another request releasing thread, growing the task queue size by one. The group size can also change when memory is reclaimed from the groups (e.g. when contains sparse segments). Compaction may start many request releasing threads due to group size updates. Such behavior is detrimental for performance and stability if there are a lot of blocked requests. This can happen on 1.5 even with modest concurrency becuase timed out requests stay in the queue. This is less likely on 1.6 where they are dropped from the queue. The releasing of tasks may start to dominate over other processes in the system. When the amount of scheduled tasks reaches 1000, polling stops and server becomes unresponsive until all of the released requests are done, which is either when they start to block on dirty memory again or run out of blocked requests. It may take a while to reach pressure condition after memtable flush if it brings virtual dirty much below the threshold, which is currently the case for workloads with overwrites producing sparse regions. Refs #2021. Fix by ensuring there is at most one request releasing thread at a time. There will be one releasing fiber per region group which is woken up when pressure is lifted. It executes blocked requests until pressure occurs.
Before, the logic for releasing writes blocked on dirty worked like this: 1) When region group size changes and it is not under pressure and there are some requests blocked, then schedule request releasing task 2) request releasing task, if no pressure, runs one request and if there are still blocked requests, schedules next request releasing task If requests don't change the size of the region group, then either some request executes or there is a request releasing task scheduled. The amount of scheduled tasks is at most 1, there is a single thread of excution. However, if requests themselves would change the size of the group, then each such change would schedule yet another request releasing thread, growing the task queue size by one. The group size can also change when memory is reclaimed from the groups (e.g. when contains sparse segments). Compaction may start many request releasing threads due to group size updates. Such behavior is detrimental for performance and stability if there are a lot of blocked requests. This can happen on 1.5 even with modest concurrency becuase timed out requests stay in the queue. This is less likely on 1.6 where they are dropped from the queue. The releasing of tasks may start to dominate over other processes in the system. When the amount of scheduled tasks reaches 1000, polling stops and server becomes unresponsive until all of the released requests are done, which is either when they start to block on dirty memory again or run out of blocked requests. It may take a while to reach pressure condition after memtable flush if it brings virtual dirty much below the threshold, which is currently the case for workloads with overwrites producing sparse regions. Refs #2021. Fix by ensuring there is at most one request releasing thread at a time. There will be one releasing fiber per region group which is woken up when pressure is lifted. It executes blocked requests until pressure occurs. The logic for notification across hierachy was replaced by calling region_group::notify_relief() from region_group::update() on the broadest relieved group.
"Before, the logic for releasing writes blocked on dirty worked like this: 1) When region group size changes and it is not under pressure and there are some requests blocked, then schedule request releasing task 2) request releasing task, if no pressure, runs one request and if there are still blocked requests, schedules next request releasing task If requests don't change the size of the region group, then either some request executes or there is a request releasing task scheduled. The amount of scheduled tasks is at most 1, there is a single releasing thread. However, if requests themselves would change the size of the group, then each such change would schedule yet another request releasing thread, growing the task queue size by one. The group size can also change when memory is reclaimed from the groups (e.g. when contains sparse segments). Compaction may start many request releasing threads due to group size updates. Such behavior is detrimental for performance and stability if there are a lot of blocked requests. This can happen on 1.5 even with modest concurrency because timed out requests stay in the queue. This is less likely on 1.6 where they are dropped from the queue. The releasing of tasks may start to dominate over other processes in the system. When the amount of scheduled tasks reaches 1000, polling stops and server becomes unresponsive until all of the released requests are done, which is either when they start to block on dirty memory again or run out of blocked requests. It may take a while to reach pressure condition after memtable flush if it brings virtual dirty much below the threshold, which is currently the case for workloads with overwrites producing sparse regions. I saw this happening in a write workload from issue #2021 where the number of request releasing threads grew into thousands. Fix by ensuring there is at most one request releasing thread at a time. There will be one releasing fiber per region group which is woken up when pressure is lifted. It executes blocked requests until pressure occurs." * tag 'tgrabiec/lsa-single-threaded-releasing-v2' of github.com:cloudius-systems/seastar-dev: tests: lsa: Add test for reclaimer starting and stopping tests: lsa: Add request releasing stress test lsa: Avoid avalanche releasing of requests lsa: Move definitions to .cc lsa: Simplify hard pressure notification management lsa: Do not start or stop reclaiming on hard pressure tests: lsa: Adjust to take into account that reclaimers are run synchronously lsa: Document and annotate reclaimer notification callbacks tests: lsa: Use with_timeout() in quiesce()
"Before, the logic for releasing writes blocked on dirty worked like this: 1) When region group size changes and it is not under pressure and there are some requests blocked, then schedule request releasing task 2) request releasing task, if no pressure, runs one request and if there are still blocked requests, schedules next request releasing task If requests don't change the size of the region group, then either some request executes or there is a request releasing task scheduled. The amount of scheduled tasks is at most 1, there is a single releasing thread. However, if requests themselves would change the size of the group, then each such change would schedule yet another request releasing thread, growing the task queue size by one. The group size can also change when memory is reclaimed from the groups (e.g. when contains sparse segments). Compaction may start many request releasing threads due to group size updates. Such behavior is detrimental for performance and stability if there are a lot of blocked requests. This can happen on 1.5 even with modest concurrency because timed out requests stay in the queue. This is less likely on 1.6 where they are dropped from the queue. The releasing of tasks may start to dominate over other processes in the system. When the amount of scheduled tasks reaches 1000, polling stops and server becomes unresponsive until all of the released requests are done, which is either when they start to block on dirty memory again or run out of blocked requests. It may take a while to reach pressure condition after memtable flush if it brings virtual dirty much below the threshold, which is currently the case for workloads with overwrites producing sparse regions. I saw this happening in a write workload from issue #2021 where the number of request releasing threads grew into thousands. Fix by ensuring there is at most one request releasing thread at a time. There will be one releasing fiber per region group which is woken up when pressure is lifted. It executes blocked requests until pressure occurs." * tag 'tgrabiec/lsa-single-threaded-releasing-v2' of github.com:cloudius-systems/seastar-dev: tests: lsa: Add test for reclaimer starting and stopping tests: lsa: Add request releasing stress test lsa: Avoid avalanche releasing of requests lsa: Move definitions to .cc lsa: Simplify hard pressure notification management lsa: Do not start or stop reclaiming on hard pressure tests: lsa: Adjust to take into account that reclaimers are run synchronously lsa: Document and annotate reclaimer notification callbacks tests: lsa: Use with_timeout() in quiesce() (cherry picked from commit 7a00dd6)
"Before, the logic for releasing writes blocked on dirty worked like this: 1) When region group size changes and it is not under pressure and there are some requests blocked, then schedule request releasing task 2) request releasing task, if no pressure, runs one request and if there are still blocked requests, schedules next request releasing task If requests don't change the size of the region group, then either some request executes or there is a request releasing task scheduled. The amount of scheduled tasks is at most 1, there is a single releasing thread. However, if requests themselves would change the size of the group, then each such change would schedule yet another request releasing thread, growing the task queue size by one. The group size can also change when memory is reclaimed from the groups (e.g. when contains sparse segments). Compaction may start many request releasing threads due to group size updates. Such behavior is detrimental for performance and stability if there are a lot of blocked requests. This can happen on 1.5 even with modest concurrency because timed out requests stay in the queue. This is less likely on 1.6 where they are dropped from the queue. The releasing of tasks may start to dominate over other processes in the system. When the amount of scheduled tasks reaches 1000, polling stops and server becomes unresponsive until all of the released requests are done, which is either when they start to block on dirty memory again or run out of blocked requests. It may take a while to reach pressure condition after memtable flush if it brings virtual dirty much below the threshold, which is currently the case for workloads with overwrites producing sparse regions. I saw this happening in a write workload from issue #2021 where the number of request releasing threads grew into thousands. Fix by ensuring there is at most one request releasing thread at a time. There will be one releasing fiber per region group which is woken up when pressure is lifted. It executes blocked requests until pressure occurs." * tag 'tgrabiec/lsa-single-threaded-releasing-v2' of github.com:cloudius-systems/seastar-dev: tests: lsa: Add test for reclaimer starting and stopping tests: lsa: Add request releasing stress test lsa: Avoid avalanche releasing of requests lsa: Move definitions to .cc lsa: Simplify hard pressure notification management lsa: Do not start or stop reclaiming on hard pressure tests: lsa: Adjust to take into account that reclaimers are run synchronously lsa: Document and annotate reclaimer notification callbacks tests: lsa: Use with_timeout() in quiesce() (cherry picked from commit 7a00dd6)
"Before, the logic for releasing writes blocked on dirty worked like this: 1) When region group size changes and it is not under pressure and there are some requests blocked, then schedule request releasing task 2) request releasing task, if no pressure, runs one request and if there are still blocked requests, schedules next request releasing task If requests don't change the size of the region group, then either some request executes or there is a request releasing task scheduled. The amount of scheduled tasks is at most 1, there is a single releasing thread. However, if requests themselves would change the size of the group, then each such change would schedule yet another request releasing thread, growing the task queue size by one. The group size can also change when memory is reclaimed from the groups (e.g. when contains sparse segments). Compaction may start many request releasing threads due to group size updates. Such behavior is detrimental for performance and stability if there are a lot of blocked requests. This can happen on 1.5 even with modest concurrency because timed out requests stay in the queue. This is less likely on 1.6 where they are dropped from the queue. The releasing of tasks may start to dominate over other processes in the system. When the amount of scheduled tasks reaches 1000, polling stops and server becomes unresponsive until all of the released requests are done, which is either when they start to block on dirty memory again or run out of blocked requests. It may take a while to reach pressure condition after memtable flush if it brings virtual dirty much below the threshold, which is currently the case for workloads with overwrites producing sparse regions. I saw this happening in a write workload from issue #2021 where the number of request releasing threads grew into thousands. Fix by ensuring there is at most one request releasing thread at a time. There will be one releasing fiber per region group which is woken up when pressure is lifted. It executes blocked requests until pressure occurs." * tag 'tgrabiec/lsa-single-threaded-releasing-v2' of github.com:cloudius-systems/seastar-dev: tests: lsa: Add test for reclaimer starting and stopping tests: lsa: Add request releasing stress test lsa: Avoid avalanche releasing of requests lsa: Move definitions to .cc lsa: Simplify hard pressure notification management lsa: Do not start or stop reclaiming on hard pressure tests: lsa: Adjust to take into account that reclaimers are run synchronously lsa: Document and annotate reclaimer notification callbacks tests: lsa: Use with_timeout() in quiesce() (cherry picked from commit 7a00dd6)
@tgrabiec can we close this issue? |
I'll verify that when running 1.7 perf testing. |
Verified on 1.7rc2, seems like issue is fixed. |
Installation details
Scylla version (or git commit hash): 1.5.0
Cluster size: 3 node, i2.2xlarge
OS (RHEL/CentOS/Ubuntu/AWS AMI): AWS AMI (ami-0d73641a)
During ami-perf-regression testing, a coredump was issued for test-read scenario.
The test uses 4 loaders and run c-s read for 50 mins, however the coredump was issued during the write phase of the test after about 1 hour.
Test Details:
First phase - Write:
cassandra-stress write no-warmup cl=QUORUM n=30000000 -schema keyspace=keyspace$2 'replication(factor=3)' -port jmx=6868 -mode cql3 native -rate threads=500 -errors ignore -pop seq=1..30000000
Second phase - Read:
cassandra-stress read no-warmup cl=QUORUM duration=50m -schema keyspace=keyspace$2 'replication(factor=3)' -port jmx=6868 -mode cql3 native -rate threads=500 -errors ignore -pop 'dist=gauss(1..30000000,15000000,1500000)'
Coredump location:
Coredump size is 56GB, zipped with pigz and split to 4GB chunks:
scylladb-users-upload.s3.amazonaws.com/test-read-1.5/xaa
scylladb-users-upload.s3.amazonaws.com/test-read-1.5/xab
scylladb-users-upload.s3.amazonaws.com/test-read-1.5/xac
scylladb-users-upload.s3.amazonaws.com/test-read-1.5/xad
Relevant section from the log:
The text was updated successfully, but these errors were encountered: