Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MemoryLocker deadlock #1603

Closed
1 task done
BeiKeJieDeLiuLangMao opened this issue Sep 6, 2019 · 6 comments · Fixed by #1605
Closed
1 task done

MemoryLocker deadlock #1603

BeiKeJieDeLiuLangMao opened this issue Sep 6, 2019 · 6 comments · Fixed by #1605

Comments

@BeiKeJieDeLiuLangMao
Copy link
Contributor

BeiKeJieDeLiuLangMao commented Sep 6, 2019

  • I have searched the issues of this repository and believe that this is not a duplicate.

Ⅰ. Issue Description

Deadlock happened in io.seata.server.lock.memory.MemoryLocker, when two branch transactions of same TM concurrently execute, and branch transaction1(BT1) wanna acquire lock with lockKeys: table1:1,2,3,4,5, BT2 wanna acquire lock with lockKeys: table1:5,4,3,2,1.

Ⅱ. Describe what happened

My analysis result is:
In io.seata.server.lock.memory.MemoryLocker#acquireLock, it will hold the bucketLockMap's monitor lock then do the acquire lock process, if PK is locked by other transaction, it means lock conflict happened, and then it will release all branch transaction acquired locks. But the point is it release all acquired locks with the bucket monitor lock holding. If two branch transaction acquire locks in a different order, deadlock happen.

Go back to what i described before, BT1 wanna acquire PK locks 1,2,3,4,5,BT2 wanna acquire PK locks 5,4,3,2,1, they execute concurrently:

  1. BT1 already gets PK locks 1,2
  2. BT2 already gets PK locks 5,4
  3. BT1 and BT2 both wanna to acquire PK lock 3, and BT1 actually finish it
  4. BT1 wanna to get PK lock 4, but finds it locked by BT2, then BT1 tries to release PK locks 1,2,3, at this time BT1 holds the PK lock 4's bucket monitor lock
  5. BT2 wanna to get PK lock 3, but finds it locked by BT1, then BT2 tries to release PK locks 5,4, at this time BT2 holds the PK lock 3's bucket monitor lock
  6. BT1 holds 4's bucket lock and wanna release 3(which need 3's bucket monitor lock), BT2 holds 3's bucket lock and wanna release 4, deadlock happened

Ⅲ. Describe what you expected to happen

MemoryLocker could work without deadlock.

Ⅳ. How to reproduce it (as minimally and precisely as possible)

  1. edit io.seata.server.lock.LockManagerTest
  2. copy function deadlockBranchSessionsProvider and function deadlockTest to LockManagerTest
  3. run deadlockTest
  4. deadlock happened
@ParameterizedTest
@MethodSource("deadlockBranchSessionsProvider")
public void deadlockTest(BranchSession branchSession1, BranchSession branchSession2) throws Exception {
    LockManager lockManager = new MemoryLockManagerForTest();
    final AtomicBoolean first = new AtomicBoolean();
    final AtomicBoolean second = new AtomicBoolean();
    CountDownLatch countDownLatch = new CountDownLatch(2);
    new Thread(() -> {
        try {
            first.set(lockManager.acquireLock(branchSession1));
        } catch (TransactionException e) {
           e.printStackTrace();
        } finally {
            countDownLatch.countDown();
        }
    }).start();
    new Thread(() -> {
        try {
            second.set(lockManager.acquireLock(branchSession2));
        } catch (TransactionException e) {
            e.printStackTrace();
        } finally {
            countDownLatch.countDown();
        }
    }).start();
    countDownLatch.await();
    System.out.println("Result: first is " + first.get() + " ,second is " + second.get());
    Assertions.assertTrue(first.get() || second.get());
}

static Stream<Arguments> deadlockBranchSessionsProvider() {
    BranchSession branchSession1 = new BranchSession();
    branchSession1.setTransactionId(UUIDGenerator.generateUUID());
    branchSession1.setBranchId(1L);
    branchSession1.setClientId("c1");
    branchSession1.setResourceGroupId("my_test_tx_group");
    branchSession1.setResourceId("tb_1");
    branchSession1.setLockKey("t:1,2,3,4,5");
    branchSession1.setBranchType(BranchType.AT);
    branchSession1.setApplicationData("{\"data\":\"test\"}");
    branchSession1.setBranchType(BranchType.AT);

    BranchSession branchSession2 = new BranchSession();
    branchSession2.setTransactionId(UUIDGenerator.generateUUID());
    branchSession2.setBranchId(2L);
    branchSession2.setClientId("c1");
    branchSession2.setResourceGroupId("my_test_tx_group");
    branchSession2.setResourceId("tb_1");
    branchSession2.setLockKey("t:5,4,3,2,1");
    branchSession2.setBranchType(BranchType.AT);
    branchSession2.setApplicationData("{\"data\":\"test\"}");
    branchSession2.setBranchType(BranchType.AT);
    return Stream.of(
            Arguments.of(branchSession1, branchSession2));
}

I add some logs in MemoryLocker, these logs could prove my analysis:

2019-09-06 14:19:58.553 INFO [Thread-2]io.seata.core.lock.AbstractLocker.acquireLock:96 -Acquire lock process wanna use bucket(1488605238)'s monitor lock, content is: {}, transaction: 1002
2019-09-06 14:19:58.553 INFO [Thread-1]io.seata.core.lock.AbstractLocker.acquireLock:96 -Acquire lock process wanna use bucket(839131056)'s monitor lock, content is: {}, transaction: 1001
2019-09-06 14:19:58.594 INFO [Thread-2]io.seata.core.lock.AbstractLocker.acquireLock:134 -Acquire lock process already free bucket(1488605238)'s monitor lock, content is: {5=1002}, transaction: 1002
2019-09-06 14:19:58.594 INFO [Thread-1]io.seata.core.lock.AbstractLocker.acquireLock:134 -Acquire lock process already free bucket(839131056)'s monitor lock, content is: {1=1001}, transaction: 1001
2019-09-06 14:19:58.594 INFO [Thread-2]io.seata.core.lock.AbstractLocker.acquireLock:96 -Acquire lock process wanna use bucket(1051681639)'s monitor lock, content is: {}, transaction: 1002
2019-09-06 14:19:58.594 INFO [Thread-1]io.seata.core.lock.AbstractLocker.acquireLock:96 -Acquire lock process wanna use bucket(2030237312)'s monitor lock, content is: {}, transaction: 1001
2019-09-06 14:19:58.595 INFO [Thread-1]io.seata.core.lock.AbstractLocker.acquireLock:134 -Acquire lock process already free bucket(2030237312)'s monitor lock, content is: {2=1001}, transaction: 1001
2019-09-06 14:19:58.595 INFO [Thread-2]io.seata.core.lock.AbstractLocker.acquireLock:134 -Acquire lock process already free bucket(1051681639)'s monitor lock, content is: {4=1002}, transaction: 1002
2019-09-06 14:19:58.595 INFO [Thread-1]io.seata.core.lock.AbstractLocker.acquireLock:96 -Acquire lock process wanna use bucket(1448018216)'s monitor lock, content is: {}, transaction: 1001
2019-09-06 14:19:58.595 INFO [Thread-2]io.seata.core.lock.AbstractLocker.acquireLock:96 -Acquire lock process wanna use bucket(1448018216)'s monitor lock, content is: {}, transaction: 1002
2019-09-06 14:19:58.595 INFO [Thread-1]io.seata.core.lock.AbstractLocker.acquireLock:134 -Acquire lock process already free bucket(1448018216)'s monitor lock, content is: {3=1001}, transaction: 1001
2019-09-06 14:19:58.595 INFO [Thread-2]io.seata.core.lock.AbstractLocker.acquireLock:115 -Global lock on [t:3] is holding by 1001
2019-09-06 14:19:58.595 INFO [Thread-1]io.seata.core.lock.AbstractLocker.acquireLock:96 -Acquire lock process wanna use bucket(1051681639)'s monitor lock, content is: {4=1002}, transaction: 1001
2019-09-06 14:19:58.596 INFO [Thread-1]io.seata.core.lock.AbstractLocker.acquireLock:115 -Global lock on [t:4] is holding by 1002
2019-09-06 14:19:58.610 INFO [Thread-2]io.seata.common.loader.EnhancedServiceLoader.loadFile:237 -load Locker[file] extension by class[io.seata.server.lock.memory.MemoryLocker]
2019-09-06 14:19:58.611 INFO [Thread-2]io.seata.core.lock.AbstractLocker.releaseLock:150 -Release lock process wanna use bucket(1051681639)'s monitor lock, content is: {4=1002}, transaction: 1002
2019-09-06 14:19:58.611 INFO [Thread-1]io.seata.core.lock.AbstractLocker.releaseLock:150 -Release lock process wanna use bucket(839131056)'s monitor lock, content is: {1=1001}, transaction: 1001
2019-09-06 14:19:58.611 INFO [Thread-1]io.seata.core.lock.AbstractLocker.releaseLock:162 -Release lock process already free bucket(839131056)'s monitor lock, content is: {}, transaction: 1001
2019-09-06 14:19:58.611 INFO [Thread-1]io.seata.core.lock.AbstractLocker.releaseLock:150 -Release lock process wanna use bucket(1448018216)'s monitor lock, content is: {3=1001}, transaction: 1001

Ⅴ. Anything else we need to know?

This bug could been resolved by moving the lock release process to the out of bucket monitor lock scope:

// in io.seata.server.lock.memory.MemoryLocker#acquireLock
// use this flag to control release all acquired locks
boolean lockConflict = false;
synchronized (bucketLockMap) {
                Long lockingTransactionId = bucketLockMap.get(pk);
                if (lockingTransactionId == null) {
                 //No existing lock
                 bucketLockMap.put(pk, transactionId);
                    Set<String> keysInHolder = bucketHolder.get(bucketLockMap);
                    if (keysInHolder == null) {
                        bucketHolder.putIfAbsent(bucketLockMap, new ConcurrentSet<String>());
                        keysInHolder = bucketHolder.get(bucketLockMap);
                    }
                    keysInHolder.add(pk);

                } else if (lockingTransactionId.longValue() == transactionId) {
                    // Locked by me
                    continue;
                } else {
                    lockConflict = true;
                    LOGGER.info("Global lock on [" + tableName + ":" + pk + "] is holding by " + lockingTransactionId);
                }
            }
            // Release all acquired locks without holding bucket monitor lock
            if (lockConflict) {
                try {
                    // Release all acquired locks.
                    branchSession.unlock();
                } catch (TransactionException e) {
                    throw new FrameworkException(e);
                }
                return false;
}

There is another solution to fix deadlock: sort PK list in alphabetical order before acquire these locks, this solution not only could resolve this problem, but also bring another advantage, i will talk about that later. Now go back to lock release process, i think it's necessary to take it out from bucket lock scope, because there isn't the place it should to be.It could reduce code risk of deadlock, and reduce bucket monitor lock's holding duration.

Now let's talk about advantage of sorting PK lock list. If two branch transaction concurrently execute with different PK list order, these two BTs will both failed, of course RM will retry to acquire PK locks, but may failure again, and in extreme cases, such a scenario will always be staged.Once you make these two branch transaction execute with same PK list order, always one BT could execute normally.

Acquire locks in same order:

2019-09-06 14:52:23.472 INFO [Thread-2]io.seata.core.lock.AbstractLocker.acquireLock:115 -Global lock on [t:1] is holding by 1001
2019-09-06 14:52:23.492 INFO [Thread-2]io.seata.common.loader.EnhancedServiceLoader.loadFile:237 -load Locker[file] extension by class[io.seata.server.lock.memory.MemoryLocker]
Result: first is true ,second is false

Acquire locks in different order:

2019-09-06 15:10:16.804 INFO [Thread-2]io.seata.core.lock.AbstractLocker.acquireLock:115 -Global lock on [t:3] is holding by 1001
2019-09-06 15:10:16.804 INFO [Thread-1]io.seata.core.lock.AbstractLocker.acquireLock:115 -Global lock on [t:4] is holding by 1002
2019-09-06 15:10:16.822 INFO [Thread-2]io.seata.common.loader.EnhancedServiceLoader.loadFile:237 -load Locker[file] extension by class[io.seata.server.lock.memory.MemoryLocker]
Result: first is false ,second is false

In summary, i think sort PK list before acquire locks is vary helpful. If you guys highly care about TC's performance, could do this work in RM and make sure PK list order could been kept until acquiring locks.

Ⅵ. Environment:

  • JDK version : 1.8
  • OS : MacOS
  • Others:
@xingfudeshi
Copy link
Member

Good job!

@BeiKeJieDeLiuLangMao
Copy link
Contributor Author

@xingfudeshi
Could you please help me merge my pr #1605

@xingfudeshi
Copy link
Member

xingfudeshi commented Sep 10, 2019

Hey.Thanks for your work.I'd like to merge this PR.But we need at least two approvals from committer.

@BeiKeJieDeLiuLangMao
Copy link
Contributor Author

Thank you for your review request.☺️

@slievrly
Copy link
Member

@BeiKeJieDeLiuLangMao It must obtain a database lock before getting a global lock. So for BT1 it needs to get all the db locks of table1 (5 row locks), and then try to get the global lock. At the same time, BT2 can't get any db lock in 5 db locks, so it can't do the global lock acquisition in step 2.

@BeiKeJieDeLiuLangMao
Copy link
Contributor Author

@slievrly
In normal case, scenario do likes what you described. Do you means we shouldn't fix this issue ?

If BT1 send branchRegister request successfully, then crash, db locks released. After that, BT2 send a branchRegister request with same locks but different order. Dead lock still could happen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants