Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize bulk deletion of exclusive queues when connection is closed/lost #1566

Closed
michaelklishin opened this issue Mar 28, 2018 · 4 comments

Comments

Projects
None yet
2 participants
@michaelklishin
Copy link
Member

commented Mar 28, 2018

#1513 covers a case where a cluster member that hosts a lot of (exclusive or otherwise ready for deletion) queues stops and that causes very high lock contention because of how the records and stats were cleaned up.

We still have many things to optimize in the case where a connection that owned a large number of queues (my tests were done with about 100K) closes. According to Observer, metrics collectors are by far the busiest processes. Some of the work we did in #1526 should apply for this case as well, e.g. metrics and record cleanup can be batched/take fewer internal DB queries.

@hairyhum

This comment has been minimized.

Copy link
Contributor

commented May 2, 2018

I did some measurements. Having 100K queues it takes around 54 seconds (without any workload) to remove them on 8 cores. If we skip stats cleanup it takes 53 seconds. Without removing data from mnesia - around 13 seconds. And without cleaning up bindings - around 13 seconds. So the main mnesia contention happens on bindings cleanup and cannot be optimised by batching the operations. It can be optimised in the bindings cleanup code though and will benefit any queue deletion.

hairyhum added a commit that referenced this issue May 4, 2018

Do not set table-wide and partial locks when deleting bindings.
Table-wide locks are supposed to improve performance for bulk oprations
for a single entity, but block operations for multole entities, for example
exclusive queues.
Partial locks were there since forever without any clear justification,
and I don't see any good reasons to keep them. Delete operation will aquire
a lock for an actual record avoiding partial table locks.

Addresses #1566
[#156352963]

hairyhum added a commit that referenced this issue May 14, 2018

Do not lock entire routing table when cleaning up bindings.
Instead of locking entire table we can use a custom global lock on
the affected resource (source or destination).
This can improve performance when multiple records deleted at the
same time, for example when connection with exclusive queues closes.
Resource lock also aquired when adding or removing a binding, so it
won't conflict with bulk removal.

Addresses #1566
[#156352963]

hairyhum added a commit that referenced this issue May 14, 2018

Do dirty deletes when cleaning up bindings.
Dirty deletes are faster and idempotent, which means
that it can be run in transaction as long as it's locked
in the begining of transaction, which is done in
`lock_resource`.
Speed improvement is aquired by not setting record locks
for each record, since we already have a record lock

Addresses #1566
[#156352963]
@michaelklishin

This comment has been minimized.

Copy link
Member Author

commented May 15, 2018

Done in #1589 with some non-trivial gains.

We will see if this can be backported to 3.6.x.

@michaelklishin michaelklishin modified the milestones: 3.6.16, 3.7.6 May 15, 2018

michaelklishin added a commit that referenced this issue May 16, 2018

Do not lock entire routing table when cleaning up bindings.
Instead of locking entire table we can use a custom global lock on
the affected resource (source or destination).
This can improve performance when multiple records deleted at the
same time, for example when connection with exclusive queues closes.
Resource lock also aquired when adding or removing a binding, so it
won't conflict with bulk removal.

Addresses #1566
[#156352963]

(cherry picked from commit 5cdee15)

michaelklishin added a commit that referenced this issue May 16, 2018

Do dirty deletes when cleaning up bindings.
Dirty deletes are faster and idempotent, which means
that it can be run in transaction as long as it's locked
in the begining of transaction, which is done in
`lock_resource`.
Speed improvement is aquired by not setting record locks
for each record, since we already have a record lock

Addresses #1566
[#156352963]

(cherry picked from commit 7705d4e)
@michaelklishin

This comment has been minimized.

Copy link
Member Author

commented Aug 30, 2018

Unfrotunately we will have to undo the changes that make most of the difference in #1589 as they can lead to inconsistent views of the binding tables: transactions that perform removal are not idempotent and this can lead to situations where a rolled back transaction "restores" all bindings deleted by the exchange type module but its retry doesn't find any bindings to delete and so there are bindings left around.

The least efficient part of queue deletion is binding deletion because there is currently no way to load bindings of a queue (or exchange) without a full scan on one of the binding tables. This unfortunate design can be mitigated with secondary indices but this would be a breaking schema change and therefore can only go into 3.8.0.

This was discovered during QA of rabbitmq/rabbitmq-consistent-hash-exchange#39 and is catastrophic for that particular exchange type since it has state that depends on the state of bindings and any consistency will break routing.

Per discussion with @dcorbacho @hairyhum @kjnilsson @dumbbell, kudos to @acogoluegnes for reproducing the issue in rabbitmq/rabbitmq-consistent-hash-exchange#39.

@michaelklishin

This comment has been minimized.

Copy link
Member Author

commented Feb 8, 2019

This had side effects on binding deletion consistency and was reverted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.