Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transactionally deadlock still there on 3.2.0-M1? #1614

Closed
curreli opened this issue Sep 21, 2016 · 46 comments
Closed

Transactionally deadlock still there on 3.2.0-M1? #1614

curreli opened this issue Sep 21, 2016 · 46 comments

Comments

@curreli
Copy link

curreli commented Sep 21, 2016

I've been facing the #1274 issue on slick 3.1.1 so I upgraded to 3.2.0-M1 which supposedly contains a fix. However, after the upgrade, I'm still having deadlocks when using transactionally.

I can reproduce this issue by running more than 20 DBIOAction in parallel, with transactionally. If I remove transactionally then everything is fine. After investigating, I figure that 20 is the default number of threads in the AsynExecutor so that kind of makes sense.

def default(name: String = "AsyncExecutor.default"): AsyncExecutor = apply(name, 20, 1000)

I'm using HikariCP v2.5.1 along with PostgreSQL Driver v9.4.1208.jre7. PostgreSQL is configured to accept max 100 connections. The HikariCP data source is configured with a max pool size of 100.

A thread dump shows all AsyncExecutor threads as RUNNABLE:

"AsyncExecutor.default-20" #32 daemon prio=5 os_prio=0 tid=0x0000000024665000 nid=0x2648 runnable [0x0000000026a3e000]
   java.lang.Thread.State: RUNNABLE
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        at java.net.SocketInputStream.read(SocketInputStream.java:170)
        at java.net.SocketInputStream.read(SocketInputStream.java:141)
        at org.postgresql.core.VisibleBufferedInputStream.readMore(VisibleBufferedInputStream.java:143)
        at org.postgresql.core.VisibleBufferedInputStream.ensureBytes(VisibleBufferedInputStream.java:112)
        at org.postgresql.core.VisibleBufferedInputStream.read(VisibleBufferedInputStream.java:70)
        at org.postgresql.core.PGStream.ReceiveChar(PGStream.java:283)
        at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1799)
        at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:200)
        - locked <0x00000006c17b10b0> (a org.postgresql.core.v3.QueryExecutorImpl)
        at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:424)
        at org.postgresql.jdbc.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:161)
        at org.postgresql.jdbc.PgPreparedStatement.executeUpdate(PgPreparedStatement.java:133)
        at com.zaxxer.hikari.pool.ProxyPreparedStatement.executeUpdate(ProxyPreparedStatement.java:61)
        at com.zaxxer.hikari.pool.HikariProxyPreparedStatement.executeUpdate(HikariProxyPreparedStatement.java)
        at slick.jdbc.JdbcActionComponent$InsertActionComposerImpl$InsertOrUpdateAction$$anonfun$nativeUpsert$1.apply(JdbcActionComponent.scala:559)
        at slick.jdbc.JdbcActionComponent$InsertActionComposerImpl$InsertOrUpdateAction$$anonfun$nativeUpsert$1.apply(JdbcActionComponent.scala:556)
        at slick.jdbc.JdbcBackend$SessionDef$class.withPreparedStatement(JdbcBackend.scala:352)
        at slick.jdbc.JdbcBackend$BaseSession.withPreparedStatement(JdbcBackend.scala:414)
        at slick.jdbc.JdbcActionComponent$InsertActionComposerImpl.preparedInsert(JdbcActionComponent.scala:497)
        at slick.jdbc.JdbcActionComponent$InsertActionComposerImpl$InsertOrUpdateAction.nativeUpsert(JdbcActionComponent.scala:556)
        at slick.jdbc.JdbcActionComponent$InsertActionComposerImpl$InsertOrUpdateAction.f$1(JdbcActionComponent.scala:539)
        at slick.jdbc.JdbcActionComponent$InsertActionComposerImpl$InsertOrUpdateAction.run(JdbcActionComponent.scala:544)
        at slick.jdbc.JdbcActionComponent$SimpleJdbcProfileAction.run(JdbcActionComponent.scala:31)
        at slick.jdbc.JdbcActionComponent$SimpleJdbcProfileAction.run(JdbcActionComponent.scala:28)
        at slick.dbio.DBIOAction$$anon$4$$anonfun$run$3.apply(DBIOAction.scala:234)
        at slick.dbio.DBIOAction$$anon$4$$anonfun$run$3.apply(DBIOAction.scala:234)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
        at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
        at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
        at slick.dbio.DBIOAction$$anon$4.run(DBIOAction.scala:234)
        at slick.dbio.DBIOAction$$anon$4.run(DBIOAction.scala:232)
        at slick.basic.BasicBackend$DatabaseDef$$anon$2.liftedTree1$1(BasicBackend.scala:240)
        at slick.basic.BasicBackend$DatabaseDef$$anon$2.run(BasicBackend.scala:240)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

Before we figure this one out, is there a way I can increase the default number of threads in the AsyncExecutor?

@curreli curreli changed the title Transactionally deadlock still there on 3.2.0-M1 Transactionally deadlock still there on 3.2.0-M1? Sep 21, 2016
@szeiger
Copy link
Member

szeiger commented Oct 19, 2016

There is no way to configure the default AsyncExecutor. You'll have to create a separate one.

@szeiger
Copy link
Member

szeiger commented Oct 31, 2016

This really needs a reproduction. According to your description it is https://github.com/slick/slick/blob/master/slick-testkit/src/test/scala/slick/test/jdbc/hikaricp/SlickDeadlockTest.scala#L48-L58, which was successfully fixed by #1274.

@szeiger
Copy link
Member

szeiger commented Nov 18, 2016

I've seen this in a local test run:

[info] Test com.typesafe.slick.testkit.tests.ActionTest.testStreaming[h2rownum] started
Exception in thread "AsyncExecutor.default-2" java.lang.IllegalArgumentException: requirement failed: count cannot be decreased
         at scala.Predef$.require(Predef.scala:224)
         at slick.util.ManagedArrayBlockingQueue$$anonfun$decreaseInUseCount$1.apply$mcV$sp(ManagedArrayBlockingQueue.scala:53)
         at slick.util.ManagedArrayBlockingQueue$$anonfun$decreaseInUseCount$1.apply(ManagedArrayBlockingQueue.scala:52)
         at slick.util.ManagedArrayBlockingQueue$$anonfun$decreaseInUseCount$1.apply(ManagedArrayBlockingQueue.scala:52)
         at slick.util.ManagedArrayBlockingQueue.slick$util$ManagedArrayBlockingQueue$$locked(ManagedArrayBlockingQueue.scala:217)
         at slick.util.ManagedArrayBlockingQueue.decreaseInUseCount(ManagedArrayBlockingQueue.scala:52)
         at slick.util.AsyncExecutor$$anon$2$$anon$1.afterExecute(AsyncExecutor.scala:81)
         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1150)
         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
         at java.lang.Thread.run(Thread.java:745)

The test succeeded and it didn't cause any negative effects. This could just be an illegal state during a shutdown or something similar. But it could also point to a bug in #1274. If the in-use count is lower than it should be at some point during normal operation, this would defeat the deadlock-prevention.

@szeiger
Copy link
Member

szeiger commented Nov 18, 2016

To make my previous comment sound less alarmist: The fact that the test succeeded and that I saw this message in the console at all (which only happens for log messages between tests but not the actual test output) points to an illegal shutdown state.

@hvesalai
Copy link
Member

I've just upgraded to 3.2.0-M2 and am seeing deadlocks. I have a hikari pool with 3 connections, a queue size 1000 and 6 threads.

The form in which the deadlocks appear is

Future.traverse(bigCollection) { item =>
  val f1: Future = calculateSomething(item) // does not use the db, so should be unrelated

  f1.flatMap { x =>
    db.run {
      val a1: DBIOAction = compiledQueryFrom(x).result

      a1.flatMap { r =>
        val a2: DBIOAction = DBIO.sequence(manyUpdateActionsFrom(r))
        val a3: DBIOAction = insertAction

        (a1 andThen a2).transactionally
      }
    }
  }
}

So in essence there is a large amount of DBIOActions created at once, but not so much as to overflow the db queue.

@szeiger
Copy link
Member

szeiger commented Jan 24, 2017

@hvesalai Are the pools configured by Slick or manually? The new scheduler needs to know the correct connection pool size. (I'm not sure we ever considered the artificial case where the number of connections is lower than the number of threads, so it's possible that it wouldn't work properly in this case.)

The gist of the code above, as far as DBIO scheduling is concerned, seems to be simply synchronous action -> wait for Future -> other synchronous action (a1.flatMap { _ => a2 }). Everything else should be fused.

@hvesalai
Copy link
Member

It is a manually configured AsyncExecutor for which I hadn't configured the maxConnections. Now I changed it to have a maxConnection, but I have also made other large changes in the code to lessen the number of simultaneous DBIOActions.

I agree that everything should have been fused and so the deadlock really baffles me.

Please also see my comment in #1461 (comment)

@szeiger
Copy link
Member

szeiger commented Jan 24, 2017

So does it work now with the correctly configured size?

The flatMap cannot be fused. Otherwise you shouldn't run into the deadlock situation.

@hvesalai
Copy link
Member

I have not seen any deadlocks since I configured it with maxConnections, but I also changed the code so that it does not fire simultaneous DBIOActions.

@szeiger szeiger added this to the 3.3.0 milestone Jan 24, 2017
@szeiger
Copy link
Member

szeiger commented Jan 24, 2017

From the reports so far there is no indication that the current scheduling algorithm fails if the pool is configured correctly.

I think it would be good to reimplement scheduling in 3.3 as described by @hvesalai in #1461 (comment) and myself in the subsequent comment. This should be more robust and does not require knowledge of the pool size.

@dspasojevic
Copy link

We have seen this issue when submitting DBIOs composed of large numbers of sequenced DBIOs that affect the same set of rows in the database.

We see the three worker threads attempting to read results (the same stacktrace as @curreli), but another transaction (that was initiated first) is a postgres lock point for them (i.e. the results that the three worker threads are attempting to read will not be available until the fourth transaction clears).

In the thread dump we see no thread that is attempting to read the (available) results from the transaction that is the lock point.

postgres reports the lock point transaction as "idle in transaction" - presumably because whatever should be reading the result has left.

Running the same DBIO in isolation works, and increasing the numThreads means that we can increase the number of concurrent DBIOs that can be run, but we can always hit a point where the lock point transaction comes into existence and causes the other transaction to be dead locked.

We are running 3.2.0. numThreads is 3 and maxConnections is 15.

@cvogt cvogt added the bug label Mar 7, 2017
@szeiger
Copy link
Member

szeiger commented Mar 7, 2017

@dspasojevic Thanks, this is really useful. Can you reproduce it when numThreads and maxConnections are set to the same value?

I hadn't really thought about this case before. This could be a deadlock in the database instead of one in Slick's scheduler. It's different from #1274 but nevertheless caused by Slick's scheduler. If my assumption is correct, it should work when maxConnections is not larger than numThreads.

Since we do our own scheduling you can run into a situation where all DBIO threads are blocked on reading from the database but the row is locked and the transaction that holds the lock is currently suspended and can't run because no DBIO thread is available. The database can't detect this as a deadlock because it assumes that all connections run independently.

I can't think of any solution that preserves our more resource-efficient scheduling with suspended connections in this case. Pinned sessions are be OK as long as they are in auto-commit mode, but when a session is in transaction mode we can't give up the DBIO thread and schedule a continuation while waiting for an asynchronous result. Instead we need to block the DBIO thread to prevent another DBIO action from running which could block until the transaction is committed.

@dspasojevic
Copy link

dspasojevic commented Mar 8, 2017

@szeiger I can't reproduce after setting numThreads and maxConnections to be equal.

What effect does this have on long running streaming DBIOs? I thought that the number of additional connections over numThreads were used as housekeeping threads to make sure streaming queries are kept alive?

Worth noting that this issue is exacerbated by hikari thinking that it is the job of application code to detect and shutdown "idle in transaction" operations. Since neither slick or hikari do this, we can only detect and recover from this at the database level itself.

@szeiger
Copy link
Member

szeiger commented Mar 8, 2017

Streaming doesn't usually run in a transaction so it would still free up the thread. Or is it possible that rows get locked merely by the presence of an open result set? If that's the case then we'd have to always block the thread (at least by default).

Currently all interruptions of database I/O (which are always asynchronous) give up the thread and schedule a continuation. This includes back-pressure handling during streaming as well as non-DBIO computations (for example, whenever you call a DBIO method like map or flatMap that takes an implicit ExecutionContext). If the session is pinned (usually because of a transaction of a streaming result set) the connection is kept alive as part of the continuation.

Setting maxConnections and numThreads to the same value means that you're basically back to the one-thread-per-connection model. Even though the scheduler can give up a thread while keeping the connection around, no other work can run in the thread pool that would prevent the continuation with the suspended connection from starting immediately because there are no more connections for that work.

So the situation where this deadlock can happen is that you have multiple DBIO actions running at the same time which contend over the same lock and have asynchronous computations while holding the lock (i.e. a1.andThen(a2).transactionally for two synchronous DBIO actions a1 and a2 is always safe but a1.flatMap(_ => a2).transactionally is not). The workaround is to raise numThreads to maxConnections. This creates some overhead for thread scheduling (both in the app server and the db) but it prevents the deadlock situation.

@hvesalai
Copy link
Member

hvesalai commented Mar 8, 2017

Au contraire. In postgresql streaming (the use of a cursor) requires a transaction.

@szeiger
Copy link
Member

szeiger commented Mar 9, 2017

Ah yes, I remember needing a transaction to prevent caching of the whole result set. So I suppose without a transaction the result set gets fully materialized and there should never be a need for a lock.

@hvesalai
Copy link
Member

hvesalai commented Mar 9, 2017

Exactly so and hence if the result set is huuuuge, you want to have a transaction with the lowest applicable isolation level to prevent running out of memory.

@arobertn
Copy link

arobertn commented May 3, 2017

Just a data point, we are experiencing the exact issue as dspasojevic on Mar 8, and the explanation above makes perfect sense. The underlying problem is that a thread running a statement that is waiting for a lock is blocked, due to JDBC's architecture.

@cvogt
Copy link
Member

cvogt commented May 3, 2017

@arobertn, I might be wrong, but @dspasojevic describes having a a larger connection pool than thread pool, meaning yes, connections can run out of threads and we consider that a configuration bug in the application, which we'll hopefully make harder for people to implement at some point in the future. Meaning if you increase the thread pool size for Slick's thread pool beyond the number of connections, you should be good. Am I missing something? @dspasojevic also mentioned a relationship to an "idle transaction" being involved, but without more info on the details, I don't know the impact of that.

@hvesalai
Copy link
Member

hvesalai commented Feb 8, 2018

min/maxThreads are defined in slick.util.AsyncExecutor#L50

@jontra
Copy link

jontra commented Feb 8, 2018

@arobertn There is no configurations for max/min thread, only numThreads (see AsyncExecutor, it is always called with numThreads both for minThreads and for maxThreads.

@jontra
Copy link

jontra commented Feb 9, 2018

@hvesalai When we set numThread == maxConnection we stopped seeing this exception - even without your patch (https://github.com/hvesalai/slick/commits/3.2.1-TMNOW).
I am still not sure why. any ideas?

@hvesalai
Copy link
Member

hvesalai commented Feb 9, 2018

@jontra the fix we are using only affects AndThen and Seq. If you are not using either, then that could be one reason.

hvesalai added a commit to hvesalai/slick that referenced this issue Mar 1, 2018
See discussion in slick#1614 about deadlocks caused by maxConnections > numThreads.
@hvesalai
Copy link
Member

hvesalai commented Mar 1, 2018

I have created a series of PRs that most probably fix this or reveal bad configurations leading to this. Please everybody take a look and if you can, test.

#1853
#1854
#1855

@stubbi
Copy link

stubbi commented Apr 15, 2018

Unfortunately, we are still running into this issue after upgrading to slick 3.2.2. 😞 Not running our queries as .transactionally solves the issue. We are not touching the default thread and connection settings.

@hvesalai
Copy link
Member

@stubbi can you open a new issue and describe all the relevant settings (maxThread, minThread, maxConnections). Have you looked at the logs that there are no warnings about settings that could cause deadlocks?

@hvesalai
Copy link
Member

Also note that this issue was about deadlocks in the slick thread pool. If your deadlock happens somewhere lese (e.g. in the DB), that is completely separate issue. Slick cannot prevent deadlocks that it has no control over, such as, for example, those caused by inapropriate use of DB-level locking (implicit or explicit).

@stubbi
Copy link

stubbi commented Apr 15, 2018

@hvesalai I described our problem with more details in #1890

@jimm-porch
Copy link

Although this is supposedly fixed in 3.3, I'm stuck on 3.2 until the java.time.Instant issue 2053 is fixed.

I was hopeful that if I configured the minThreads = maxThreads = maxConnections that this would solve the problem but even after doing that I'm still seeing this exception with 3.2.3:

Exception in thread "CorePostgresDriver.AsyncExecutor-8" java.lang.IllegalArgumentException: requirement failed: count cannot be decreased
	at scala.Predef$.require(Predef.scala:224)
	at slick.util.ManagedArrayBlockingQueue$$anonfun$decreaseInUseCount$1.apply$mcV$sp(ManagedArrayBlockingQueue.scala:54)
	at slick.util.ManagedArrayBlockingQueue$$anonfun$decreaseInUseCount$1.apply(ManagedArrayBlockingQueue.scala:53)
	at slick.util.ManagedArrayBlockingQueue$$anonfun$decreaseInUseCount$1.apply(ManagedArrayBlockingQueue.scala:53)
	at slick.util.ManagedArrayBlockingQueue.locked(ManagedArrayBlockingQueue.scala:201)
	at slick.util.ManagedArrayBlockingQueue.decreaseInUseCount(ManagedArrayBlockingQueue.scala:53)
	at slick.util.AsyncExecutor$$anon$2$$anon$1.afterExecute(AsyncExecutor.scala:129)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1150)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)

Here's how I'm configuring my DatabaseDef, and I'm passing poolSize of 10, which is the same value I provide to Hikari as the max pool size:

val asyncExecutor = AsyncExecutor(
  name = "CorePostgresDriver.AsyncExecutor",
  minThreads = poolSize,
  maxThreads = poolSize,
  maxConnections = poolSize,
  queueSize = 1000 // the default used by AsyncExecutor.default()
)
api.Database.forDataSource(dataSource, Some(poolSize), executor = asyncExecutor)

@jimm-porch
Copy link

By coincidence some of my coworkers encountered transactional issues with Slick today. Instead of seeing the "count cannot be decreased" errors like I did, they saw timeouts waiting on connections from the pool. It seemed to behave OK under light load but under regular use it quickly hangs. In the DB itself we only see connections in state "idle in transaction", no obvious locks, as though commit were never invoked.

The AsyncExecutor was configured as I show above.

As far as they can tell, they are not aware of any DB locking. For now they are removing transactionally from the code to get things working.

hhulkko added a commit to Opetushallitus/koski that referenced this issue Sep 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests