Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database deadlock #200

Closed
xiaoronglv opened this issue Sep 17, 2015 · 17 comments
Closed

Database deadlock #200

xiaoronglv opened this issue Sep 17, 2015 · 17 comments

Comments

@xiaoronglv
Copy link

Hi , I encounter a problem about rpush, and try to fix it. But I can't find the root cause of

  1. Rpush version: 2.4.0
  2. rpush run in two host.
*** (1) TRANSACTION:
mysql tables in use 1, locked 1
LOCK WAIT 22 lock struct(s), heap size 3112, 240 row lock(s)
query id 2154897537 rpushhost1 Creating sort index
SELECT `rpush_notifications`.* FROM `rpush_notifications` WHERE (processing = 0 AND delivered = 0 AND failed = 0 AND (deliver_after IS NULL OR deliver_after < '2015-09-08 06:23:19')) ORDER BY created_at ASC LIMIT 100 FOR UPDATE

*** (2) TRANSACTION:
mysql tables in use 1, locked 1
3 lock struct(s), heap size 1248, 2 row lock(s), undo log entries 1
query id 2154897539 rpushhost2 updating
UPDATE `rpush_notifications` SET processing = 0, delivered = 1, delivered_at = '2015-09-08 06:23:19' WHERE `rpush_notifications`.`id` IN (47324, 47325, 47326)
*** WE ROLL BACK TRANSACTION (2)
@modx-space
Copy link

we also met this similar issue, here is the log:

LATEST DETECTED DEADLOCK
------------------------
2015-10-10 15:04:56 7f1aa365b700
*** (1) TRANSACTION:
TRANSACTION 20624196, ACTIVE 1 sec inserting
mysql tables in use 1, locked 1
LOCK WAIT 3 lock struct(s), heap size 376, 2 row lock(s), undo log entries 1
MySQL thread id 9452, OS thread handle 0x7f1aa193e700, query id 1025858 xxxx update
INSERT INTO `rpush_notifications` (`alert`, `app_id`, `badge`, `created_at`, `data`, `device_token`, `push_setting_id`, `registration_ids`, `sound`, `type`, `updated_at`, `url_args`) VALUES ('xx',...)
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 154827 page no 5 n bits 120 index `index_rpush_notifications_multi` of table `mp_prod`.`rpush_notifications` trx id 20624196 lock_mode X locks gap before rec insert intention waiting
Record lock, heap no 7 PHYSICAL RECORD: n_fields 3; compact format; info bits 0
 0: len 1; hex 80; asc  ;;
 1: len 1; hex 81; asc  ;;
 2: len 4; hex 8000f7c1; asc     ;;

*** (2) TRANSACTION:
TRANSACTION 20624195, ACTIVE 2 sec fetching rows
mysql tables in use 1, locked 1
8 lock struct(s), heap size 1248, 101 row lock(s)
MySQL thread id 14610, OS thread handle 0x7f1aa365b700, query id 1025860 xxxx Searching rows for update
UPDATE `rpush_notifications` SET processing = 1 WHERE `rpush_notifications`.`id` IN (63459, 63461, 63463, 63465, 63467, 63489, 63491, 63493, 63495, 63497, 63499, 63501, 63503, 63505, 63507, 63509, 63469, 63471, 63473, 63475, 63477, 63479, 63481, 63483, 63485, 63487)
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 154827 page no 5 n bits 120 index `index_rpush_notifications_multi` of table `mp_prod`.`rpush_notifications` trx id 20624195 lock_mode X locks gap before rec
Record lock, heap no 7 PHYSICAL RECORD: n_fields 3; compact format; info bits 0
 0: len 1; hex 80; asc  ;;
 1: len 1; hex 81; asc  ;;
 2: len 4; hex 8000f7c1; asc     ;;

*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 154827 page no 7 n bits 96 index `PRIMARY` of table `mp_prod`.`rpush_notifications` trx id 20624195 lock_mode X waiting
Record lock, heap no 29 PHYSICAL RECORD: n_fields 33; compact format; info bits 0
 0: len 4; hex 8000f817; asc     ;;
 1: len 6; hex 0000013ab344; asc    : D;;
 2: len 7; hex f4000001380110; asc     8  ;;
 3: len 4; hex 80000009; asc     ;;
 4: len 30; hex 666537316362326362613035616234373734336538336439653532613139; asc fe71cb2cba05ab47743e83d9e52a19; (total 64 bytes);
 5: len 7; hex 64656661756c74; asc default;;
 6: len 30; hex 4361726c61204772616e7420616464656420746f7069633a206e6f746573; asc Carla Grant added topic: notes; (total 40 bytes);
 7: len 30; hex 7b2270223a22632c536170706869726530342c636772616e7431222c2274; asc {"p":"c,Sapphire04,cgrant1","t; (total 99 bytes);
 8: len 4; hex 80015180; asc   Q ;;
 9: len 1; hex 80; asc  ;;
 10: SQL NULL;
 11: len 1; hex 80; asc  ;;
 12: SQL NULL;
 13: SQL NULL;
 14: SQL NULL;
 15: SQL NULL;
 16: len 5; hex 9997556136; asc   Ua6;;
 17: len 5; hex 9997556136; asc   Ua6;;
 18: len 1; hex 80; asc  ;;
 19: len 30; hex 52707573683a3a436c69656e743a3a4163746976655265636f72643a3a41; asc Rpush::Client::ActiveRecord::A; (total 47 bytes);
 20: SQL NULL;
 21: len 1; hex 80; asc  ;;
 22: SQL NULL;
 23: len 4; hex 80000001; asc     ;;
 24: len 4; hex 80000000; asc     ;;
 25: SQL NULL;
 26: SQL NULL;
 27: SQL NULL;
 28: len 1; hex 80; asc  ;;
 29: SQL NULL;
 30: SQL NULL;
 31: SQL NULL;
 32: len 4; hex 8000235a; asc   #Z;;

*** WE ROLL BACK TRANSACTION (1)

But we found that we had two rpush processes running on the host during the issue happened, so I am wondering whether rpush supports multiple instances running in the same environment?

@amaierhofer
Copy link
Contributor

We encountered similar Deadlock issues, which appear more frequently for a higher number of notifications, larger batch_size e.g. 2000 and short poll_interval e.g. 0.5. We are using mysql2 0.3.20.

Occasionally the recovering mechanism also fails and the daemon stops delivering messages all together. Sometimes the queue (as queried with rpush status) fills up, sometimes even the feeder thread seems to hang. In both cases no additional information is available in the log files.

It appears that introducing additional sleep time between statements in https://github.com/rpush/rpush/blob/master/lib/rpush/daemon/store/active_record/reconnectable.rb#L55 mitigates the problem but it still occurs occasionally.

Any ideas on who to debug this further? What settings work best for achieving high throughput for APNS on mysql without running the risk of Deadlocks?

@xiaoronglv
Copy link
Author

Hi @amaierhofer

I have reproduced in my localhost and fixed this problem by refactoring index.

Conflict transactions

Transaction 1

SELECT  `rpush_notifications`.* FROM `rpush_notifications` WHERE (processing = 0 AND delivered = 0 AND failed = 0 AND (deliver_after IS NULL OR deliver_after < '2015-08-21 16:48:10')) ORDER BY created_at ASC LIMIT 100 FOR UPDATE

Transaction 2

UPDATE `rpush_notifications` SET processing = 0, delivered = 1, delivered_at = '2015-08-21 16:48:10' WHERE `rpush_notifications`.`id` IN (647770)

Root cause

thumbnail

The counterintuitive part of transaction1 is that, In Repeated read isolate level, Innodb engine will lock all rows which satisfy 'failed=0 and delivered=0' , it don't care the following conditions.

  • limit 100;
  • processing=0
  • deliver_after IS NULL OR deliver_after < '2015-08-21 17:47:32'
  • ORDER BY created_at ASC

These conditions are filtered in MySQL sever level, not engine level.

thumbnail-1

https://github.com/rpush/rpush/blob/master/lib/rpush/daemon/store/active_record.rb#L30

https://github.com/rpush/rpush/blob/master/lib/rpush/daemon/store/active_record.rb#L95

How to reproduce this bug in your localhost. (run it in rails c)

Process A

loop do
    ActiveRecord::Base.connection.execute("SELECT `rpush_notifications`.* FROM `rpush_notifications` WHERE (processing = 0 AND delivered = 0 AND failed = 0 AND (deliver_after IS NULL OR deliver_after < '2015-10-21 17:47:32')) ORDER BY created_at ASC LIMIT 100 for update;")
end

Process B

Rpush::Notification.where(processing: true).find_each do |ntf|
id = ntf.id
ActiveRecord::Base.connection.execute("UPDATE `rpush_notifications` SET processing = 0, delivered = 1, delivered_at = '2015-08-21 16:48:10' WHERE `rpush_notifications`.`id` IN (#{id})")
sleep 0.5
end

Solution

Modify the following index to reduce lock rows when select * from * for update.

DROP INDEX `index_rpush_notifications_multi` ON rpush_notifications;
CREATE INDEX index_rpush_notifications_multi ON rpush_notifications (delivered, failed, processing, deliver_after)

@xiaoronglv
Copy link
Author

Deadlock will raise an error, then interrupt the thread 'Feeder'.

That's the reason why you rpush stop work.

@amaierhofer
Copy link
Contributor

Many thanks @xiaoronglv for sharing your insights! I can confirm, that this solution works!

Without adjusting the index, running the code from Process A and the rpush daemon at the same time, triggered plenty of deadlocks. After making the suggested adjustments, no more deadlocks occured.

The index_rpush_notifications_multi index as created in

add_index :rpush_notifications, [:delivered, :failed], name: 'index_rpush_notifications_multi', where: 'NOT delivered AND NOT failed'
should probably be adjusted, so new mysql deployments wont run into this issue at all. Also the partial index is not supported by mysql. @ileitch, what do you think?

@daniel-illi
Copy link
Contributor

@xiaoronglv, @amaierhofer What is the impact on the performance when including all attributes in the index instead of using a partial index?

@xiaoronglv
Copy link
Author

It's a balance about reuse / index size / read and write.

advantage

full Index covered query is more faster.

disadvantage

  1. index is too large.
  2. destroy/insert/update will be more slower(write)

@dzubi

@gsusmonzon
Copy link

Hello,

I followed a different approach.
In my setup changing the index didnt helped too much (the selectivity was still low and locked many rows).
However sorting by id instead of created_at saved from doing a 'filesort' and the query run faster.
It means monkey patching the ready_for_delivery method https://github.com/rpush/rpush/blob/master/lib/rpush/daemon/store/active_record.rb#L191 . But my approach is kind of dirty.

def ready_for_delivery
  relation = Rpush::Client::ActiveRecord::Notification.where('processing = ? AND delivered = ? AND failed = ? AND (deliver_after IS NULL OR deliver_after < ?)', false, false, false, Time.now)
  @using_oracle ? relation : relation.order('created_at ASC') #change order to `id ASC`
end

Hope it helps. But I guess that it only makes the deadlock less likely. If the .lock(true) method is used ( https://github.com/rpush/rpush/blob/master/lib/rpush/daemon/store/active_record.rb#L208 ) deadlocks cant be avoided completely.

@soulfly
Copy link

soulfly commented May 6, 2016

Hi there,

also fighting with similar strange issues

under the high load, we noticed that a lot of notifications remain in 'processing' state forever and the number of such notifications increases time to time

for example,

APNS processing: 429
GCM processing: 215
WPNS processing: 1

and we don't know why

also, we sometimes see the following in rpush.log:

[2016-05-06 14:01:27] [ERROR] ActiveRecord::StatementInvalid, Mysql2::Error: Deadlock found when trying to get lock; try restarting transaction: UPDATE rpush_notifications SET processing = 0, delivered = 1, delivered_at = '2016-05-06 14:01:27' WHERE rpush_notifications.id IN (13615806)

So looks like @xiaoronglv is right

We are going to try his solution, will let you know

We use the latest rpush 2.7.0

@modx-space
Copy link

Hi @soulfly , how about the solution? any updates?

@soulfly
Copy link

soulfly commented Sep 5, 2016

@shawzt looks like it works better,
we have less records in 'processing' state but anyway we still have it for some reason..

@modx-space
Copy link

@soulfly Good to know. I will apply the changes to the index in our new project.

@soulfly
Copy link

soulfly commented Sep 7, 2016

At the moment we have another version in-parallel of our project with Redis instead of ActiveRecord.
We don't have such issues there. Still testing Redis for more visibility.

@modx-space
Copy link

@soulfly , how long you have been running it with Redis?

@soulfly
Copy link

soulfly commented Sep 8, 2016

@shawzt it's already ~3 or 4 months up and running

@amaierhofer
Copy link
Contributor

@soulfly how did switching to redis affect message throughput? Do you have any numbers to share? We are currently pushing around 100k in 5 minutes with active record backend.

@soulfly
Copy link

soulfly commented Sep 27, 2016

@amaierhofer we do not have ready-to-share performance metrics
but Redis is in-memory store and all read/write operations work faster here with proper setup

loadhigh added a commit to loadhigh/rpush that referenced this issue Mar 15, 2018
1. Change the index on rpush_notifications to minimized number of locked records and pre-sort the records
2. Minimize the locking duration by moving the row dump code outside of the transaction

fixes rpush#200
aried3r pushed a commit that referenced this issue Apr 11, 2018
1. Change the index on rpush_notifications to minimized number of locked records and pre-sort the records
2. Minimize the locking duration by moving the row dump code outside of the transaction

fixes #200
masolin added a commit to cardinalblue/rpush that referenced this issue Jan 28, 2019
v3.1.1

Fixed

- Database deadlock [rpush#200](rpush#200) (by [@loadhigh](https://github.com/loadhigh) in [rpush#428](rpush#428))

Enhancements

- Change the index on `rpush_notifications` to minimize number of locked records and pre-sort the records ([rpush#428](rpush#428) by [@loadhigh](https://github.com/loadhigh))
- Minimize the locking duration by moving the row dump code outside of the transaction ([rpush#428](rpush#428) by [@loadhigh](https://github.com/loadhigh))
ormus2002 pushed a commit to ormus2002/rpush that referenced this issue Apr 28, 2020
1. Change the index on rpush_notifications to minimized number of locked records and pre-sort the records
2. Minimize the locking duration by moving the row dump code outside of the transaction

fixes rpush#200
ormus2002 pushed a commit to ormus2002/rpush that referenced this issue May 4, 2020
1. Change the index on rpush_notifications to minimized number of locked records and pre-sort the records
2. Minimize the locking duration by moving the row dump code outside of the transaction

fixes rpush#200
Adrian1707 pushed a commit to Adrian1707/rpush that referenced this issue Apr 24, 2024
1. Change the index on rpush_notifications to minimized number of locked records and pre-sort the records
2. Minimize the locking duration by moving the row dump code outside of the transaction

fixes rpush#200
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants