Skip to content
This repository has been archived by the owner on Sep 2, 2021. It is now read-only.

Deadlock error while running test suite #121

Closed
anuragsoni opened this issue Dec 8, 2016 · 28 comments · Fixed by #147 or #154
Closed

Deadlock error while running test suite #121

anuragsoni opened this issue Dec 8, 2016 · 28 comments · Fixed by #147 or #154

Comments

@anuragsoni
Copy link

I was following the setup guide from the README (using docker and docker-compose) When running the full test suite ./scripts/cargo test I sometimes see random failures in the following tests.

test api::r0::profile::tests::put_avatar_url_unauthorized ... FAILED
test api::r0::room_creation::tests::with_unknown_invited_users ... FAILED
test api::r0::profile::tests::put_displayname_unauthorized ... FAILED

Turning on the debug logs i can see the following:
DEBUG:ruma::error: Converting to ApiError from: DatabaseError(__Unknown, "deadlock detected")
DEBUG:ruma::error: Converting to ApiError from: UserReturnedError(ApiError { errcode: Unknown, error: "An unknown server-side error occurred." })
test api::r0::profile::tests::put_avatar_url_unauthorized ... �[31mFAILED�(B�[0m
DEBUG:ruma::test: Username: {"username": "alice", "password": "secret"}
DEBUG:ruma::error: Converting to ApiError from: DatabaseError(__Unknown, "deadlock detected")
DEBUG:ruma::error: Converting to ApiError from: UserReturnedError(ApiError { errcode: Unknown, error: "An unknown server-side error occurred." })
test api::r0::room_creation::tests::with_invited_users ... �[31mFAILED�(B�[0m
DEBUG:ruma::test: Username: {"username": "bob", "password": "secret"}
DEBUG:ruma::error: Converting to ApiError from: DatabaseError(__Unknown, "deadlock detected")
DEBUG:ruma::error: Converting to ApiError from: UserReturnedError(ApiError { errcode: Unknown, error: "An unknown server-side error occurred." })
test api::r0::profile::tests::put_displayname_unauthorized ... �[31mFAILED�(B�[0m

Note: Running the tests individually works with a success.

steps to reproduce:

  • ./scripts/cargo test
@farodin91
Copy link
Member

Could you give as more informations, for example the output of docker version?

@mujx
Copy link

mujx commented Dec 8, 2016

I can't reproduce with the latest master. What do you mean by "running the tests individually"?

@farodin91
Copy link
Member

I can reproduce this issue.

---- api::r0::room_creation::tests::with_unknown_invited_users stdout ----
	thread 'api::r0::room_creation::tests::with_unknown_invited_users' panicked at 'called `Option::unwrap()` on a `None` value', ../src/libcore/option.rs:323
stack backtrace:
   1:     0x55e6d4a05a4a - std::sys::imp::backtrace::tracing::imp::write::hbea47d9dd19b523c
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/obj/../src/libstd/sys/unix/backtrace/tracing/gcc_s.rs:42
   2:     0x55e6d4a08d1f - std::panicking::default_hook::{{closure}}::h6875a2976258b020
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/obj/../src/libstd/panicking.rs:247
   3:     0x55e6d4a088bd - std::panicking::default_hook::h88ffbc5922643264
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/obj/../src/libstd/panicking.rs:257
   4:     0x55e6d4a091c7 - std::panicking::rust_panic_with_hook::hc790e47d4ecc86cd
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/obj/../src/libstd/panicking.rs:451
   5:     0x55e6d4a09054 - std::panicking::begin_panic::hc066339e2fdc17d1
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/obj/../src/libstd/panicking.rs:413
   6:     0x55e6d4a08f79 - std::panicking::begin_panic_fmt::h5912b2d2df332044
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/obj/../src/libstd/panicking.rs:397
   7:     0x55e6d4a08f07 - rust_begin_unwind
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/obj/../src/libstd/panicking.rs:373
   8:     0x55e6d4a4554d - core::panicking::panic_fmt::h561c5ee168a3d2cb
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/obj/../src/libcore/panicking.rs:69
   9:     0x55e6d4a45484 - core::panicking::panic::h194ce5d68a8f28a1
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/obj/../src/libcore/panicking.rs:49
  10:     0x55e6d424f646 - <core::option::Option<T>>::unwrap::hff2cc58add4fb7f8
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/obj/../src/libcore/macros.rs:21
  11:     0x55e6d460144e - ruma::test::Test::create_access_token_with_username::h33d49eb80b9f4fc7
                        at /source/src/test.rs:165
  12:     0x55e6d45ce85f - ruma::api::r0::room_creation::tests::with_unknown_invited_users::h28e92bd926b45ba2
                        at /source/src/api/r0/room_creation.rs:234
  13:     0x55e6d49d376e - <F as test::FnBox<T>>::call_box::h471b07b4de264cc7
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/obj/../src/libtest/lib.rs:1274
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/obj/../src/libtest/lib.rs:140
  14:     0x55e6d4a11cca - __rust_maybe_catch_panic
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/obj/../src/libpanic_unwind/lib.rs:97
  15:     0x55e6d49c845a - std::panicking::try::do_call::h0f919d4447ff8cf3
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/obj/../src/libstd/panicking.rs:332
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/obj/../src/libstd/panic.rs:351
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/obj/../src/libtest/lib.rs:1219
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/obj/../src/libstd/panic.rs:295
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/obj/../src/libstd/panicking.rs:356
  16:     0x55e6d4a11cca - __rust_maybe_catch_panic
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/obj/../src/libpanic_unwind/lib.rs:97
  17:     0x55e6d49cee42 - <F as alloc::boxed::FnBox<A>>::call_box::hfe62b0b14bb937e4
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/obj/../src/libstd/panicking.rs:332
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/obj/../src/libstd/panic.rs:351
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/obj/../src/libstd/thread/mod.rs:287
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/obj/../src/liballoc/boxed.rs:595
  18:     0x55e6d4a07ff4 - std::sys::imp::thread::Thread::new::thread_start::h8084b1107992ae5b
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/obj/../src/liballoc/boxed.rs:605
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/obj/../src/libstd/sys_common/thread.rs:21
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/obj/../src/libstd/sys/unix/thread.rs:84
  19:     0x7fc4c0c1b0a3 - start_thread
  20:     0x7fc4c073a62c - clone
  21:                0x0 - <unknown>

Some time 1 is failing, some 2 or more

@farodin91
Copy link
Member

This looks like something with unknown users.

@anuragsoni
Copy link
Author

@mujx by individual tests i meant: cargo test put_avatar_url_unauthorized

@mujx
Copy link

mujx commented Dec 8, 2016

I tried two other Linux machines (Debian 8, Arch) and still can't reproduce the issue.

@farodin91 Turn debug logging on and see if there is any deadlock. What other tests fail?

@anuragsoni Can you run make test on the previous commits and see when the problem occurs?

@anuragsoni
Copy link
Author

@mujx I will try that later today. Also I do have the the debug logs from the test run that failed for me. I just attached a small snippet above but I can attach the full file later today when I get home, if you think that will help. I was able to reproduce this on both the docker image and on a postgresql instance installed on my own machine (Ubuntu 16.10, postgresql-9.5)

@mujx
Copy link

mujx commented Dec 8, 2016

I think I reproduced the issue. I increased the number of test threads by adding -e RUST_TEST_THREADS=6 to cargo script. @anuragsoni test with RUST_TEST_THREADS=1 and see if the problem persists.

The problem seems to be happening when registering a new user with this helper function.

The failures are random but they all have the same source.

@farodin91
Copy link
Member

This never happened before it must one of last 5 commits.

@anuragsoni
Copy link
Author

@mujx using RUST_TEST_THREADS=1 works, and that makes sense I guess since every test will be run sequentially.

@jimmycuadra
Copy link
Member

We don't actually want to restrict tests to one thread, though. Each test case should happen in a transaction so they don't step on each other like this.

@anuragsoni
Copy link
Author

That is true. There shouldn't be deadlocks while running tests

@anuragsoni
Copy link
Author

anuragsoni commented Jan 4, 2017

So I have been trying to figure out the reasons for this deadlock in tests but I was unable to get better logs than the ones I attached to the issue above. I tried to replicate the same scenario in another project where I use PostgreSQL and this is what I can tell so far:

Minimal example to reproduce:

Transaction 1 executes:

INSERT INTO USERS (id, password_hash, active) VALUES ("alice", "foo",false);

this acquires a lock with the user id "alice"

Transaction 2 executes:

INSERT INTO USERS (id, password_hash, active) VALUES ('bob', 'foo',false);
INSERT INTO USERS (id, password_hash, active) VALUES ('alice', 'foo',false);

The first row will successfully acquire a lock and will be executes. But
the second statement finds that the row its attempting to lock has already
been locked, so it waits for transactions1 to finish.

Transaction 1 executes:

INSERT INTO USERS (id, password_hash, active) VALUES ('bob', 'foo',false);

Transaction1 attempts to acquire a row level lock for bob but
transaction2 already has a lock on the specified row so it waits for
transaction2 to finish. This leads to the deadlock as both transactions
are waiting for the other to finish.

The actual error in postgresql will look something like

ERROR:  deadlock detected
DETAIL:  Process 10746 waits for ShareLock on transaction 3983; blocked by process 10929.
Process 10929 waits for ShareLock on transaction 3982; blocked by process 10746.
HINT:  See server log for query details.
CONTEXT:  while inserting index tuple (0,4) in relation "users_pkey"

@jimmycuadra
Copy link
Member

Thanks very much for digging into this. I'm not super strong with databases—could you explain exactly how the changes in #147 fix the problem? My understanding was that using transactions would isolate the SQL statements in each test from each other, but clearly they still create locks that can affect each other. How does reordering the lines in those tests avoid this? I think I'm just missing the rule of thumb to follow on how to spot this and avoid it in the future. Does this careful ordering also prevent deadlocks from happening at runtime in production? It seems like what happened here with the tests could just as easily happen with concurrent requests on the server.

@anuragsoni
Copy link
Author

I found the following documentation for deadlocks arising because of row level locks in postgres: https://www.postgresql.org/docs/9.4/static/explicit-locking.html

Like you i'm not super strong with the details of database systems, but this problem reminded me of an issue we had at my workplace a while back. In short my understanding is that the goal is to avoid cycles in concurrent code. In my mind I translate a transaction to a thread that is trying to lock a resources.

If the rows are locked in a consistent order, i.e. we always try to lock Row A before Row B, then we will avoid cycles. The transactions will be blocking but won't cause a deadlock. The re-ordering in the tests ensured that the insert calls to the user table are always done in descending order of the primary key (user id in this case, and this could have been ascending order too, it just needs to be deterministic).

In the example above if transaction2 locked the rows in the same order as transaction1:

INSERT INTO USERS (id, password_hash, active) VALUES ('alice', 'foo',false);
INSERT INTO USERS (id, password_hash, active) VALUES ('bob', 'foo',false);

it will see that row "alice" is locked by transaction1 so it would wait for it to finish and now acquire a lock for row "bob", as a result transaction1 can continue its run, lock row "bob", do its work, and then finish. This avoid deadlock by not allowing a cyclic dependency to occur between the two transactions.

I suppose yes, this scenario can happen in actual requests on the server.

Another articles that I found useful:
https://web.archive.org/web/20160504235538/http://blog.nordeus.com/dev-ops/postgresql-locking-revealed.htm

@anuragsoni
Copy link
Author

Also, when a deadlock is detected the database system will abort the transaction, so one more approach could be to retry any transaction that was aborted because of deadlocks?

@jimmycuadra
Copy link
Member

jimmycuadra commented Jan 4, 2017

Would it also work to create a dedicated ID column for the users table? Would that prevent two DB connections trying to create a user with the same name at the same time from locking the same row?

Edit: And would that even matter, since we would still need a uniqueness constraint on the Matrix user ID? Or is the table's primary key the only concern for this type of locking?

@mujx
Copy link

mujx commented Jan 4, 2017

The database logs can be accessed through docker-compose logs or docker logs.

I agree with @anuragsoni that we should be able to retry some of the transactions. I tried jimmy's solution with the extra id but there are still failures when setting the unique constraint on user_ids. I will try again using a composite primary key this time. The changes are here.

Would it make sense to change the test helper in a way that it registers random strings as user_ids and
returns a struct with the necessary info?

let alice = test.register_new_user();

alice.token => "regular token";
alice.id => "@qWIoYYgNCjwTpX:ruma.test"

edit: Composite PK seems to work. @aurabindo, @anuragsoni please try to verify.

@anuragsoni
Copy link
Author

@mujx @jimmycuadra The error when setting a unique constrains on user_ids seems reasonable as that will also try to acquire a row level lock. Composite keys should work when using the SERIAL type as shown in the link shared by @mujx The other tables like events, profiles, room_aliases could potentially have the same problem too.

But then on another note, from what i've read in the matrix spec so far, when running in production like environment, it wouldn't be normal to have a single transaction trying to create/update/delete multiple resources? In our tests we were creating more than one user as part of the same transaction but in actual usage i'm guessing that there would be just one resource per create/update/delete action? If that will be true then there wouldn't be deadlocks as such. And in a case where there is a deadlock nonetheless, we should be able to retry the transaction.

As for the tests, we can control the test environment to not cause problems like this. We could use random strings as username as suggested or create resources in a specific order etc.

@jimmycuadra
Copy link
Member

Sounds like randomized user IDs would avoid the deadlock problem, but it also might be confusing to have a user referred to as "alice" whose user localpart is not actually "alice". I dunno, maybe if the helper method didn't take a localpart parameter like you show above, it would be clear enough reading the code. Technically it still wouldn't be deterministic, but the chances of two identical IDs being created in two concurrent tests is very slim.

@anuragsoni
Copy link
Author

@jimmycuadra for the tests just following a simple rule like "create names in alphabetical order" might be enough. That will ensure no deadlocks in tests. Adding random strings to the user_ids might be confusing (and more code to maintain 😁)

@mujx
Copy link

mujx commented Jan 4, 2017

I'm in favor of the random strings because that wouldn't change the test logic or add extra mental overhead when trying to write new tests (new contributors should be aware of the order situation and it should also be documented). The composite primary key solution seems unnecessary for the user's table.

We shouldn't really care if the alice id doesn't contain "alice" because the logic is independent of the names and the wrapper structs will keep the code readable. We can use the code from the access tokens to generate them.

@jimmycuadra
Copy link
Member

I reread this comment and I think I finally understand how changing the order of operations prevents the deadlock. I thought I'd write it out here just to help myself and anyone who comes across this in the future and wants to learn about this problem.

My confusion was about the point at which locks are released. Anurag says (as does the Postgres documentation) that this sequences of events:

  1. Transaction 1 locks row A.
  2. Transaction 2 locks row B.
  3. Transaction 1 attempts to lock row B.
  4. Transaction 2 attempts to lock row A.

will result in a deadlock because T1 will be waiting for row B to unlock, which will never happen because T2 is waiting for row A to unlock, which is held by T1. This implies that once a row is locked by a transaction, it stays locked until the transaction is rolled back or committed.

Now let's look at two transactions that use the same order:

  1. Transaction 1 locks row A.
  2. Transaction 2 attempts to lock row A.
  3. Transaction 1 locks row B.
  4. Transaction 2 attempts to lock row B.

Assuming the first two locks happen concurrently and T1 happens to win the lock, T2 now cannot proceed until that lock is released. However, T1 can continue on its merry way because row B is not locked—T2 is still stopped and waiting at its previous step. T1 can continue to transaction commit or roll back, at which point the locks on rows A and B will be released, and T2 can now proceed unencumbered.

@jimmycuadra
Copy link
Member

Sounds like we all agree on randomized user IDs from the helper method. I believe this will require updating some tests where user IDs are hardcoded into JSON that we are asserting against. I'm inclined to merge #147 for the quick fix, but leave this open until we make the change to the helper method in another PR.

@anuragsoni
Copy link
Author

anuragsoni commented Jan 4, 2017

@jimmycuadra @mujx randomized user IDs sound good then.

@jimmycuadra #147 can indeed be a stopgap to unblock any other PR that might be having the deadlock problems.

@mujx are you going to work on the test helpers? If not i'm okay to work on that as well.

@mujx
Copy link

mujx commented Jan 4, 2017

@anuragsoni Feel free to start working on it.

@anuragsoni
Copy link
Author

@mujx Okay. I will pick it up. Also, I was thinking about the 'composite key' approach you mentioned above. That might avoid deadlocks in insert actions, but wouldn't it still lead to deadlocks for update actions?

@jimmycuadra @mujx Maybe we need another issue to track the change required to retry a transaction in case a deadlock is encountered in a normal flow in production? I don't believe we handle retries currently?

@jimmycuadra
Copy link
Member

@anuragsoni I opened #148 to discuss retries.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Development

Successfully merging a pull request may close this issue.

4 participants