Skip to content

Issue with spike of connection to database. Potential solution#9211

Closed
oherych wants to merge 1 commit intotemporalio:mainfrom
oherych:added_delay_to_close_DbConn
Closed

Issue with spike of connection to database. Potential solution#9211
oherych wants to merge 1 commit intotemporalio:mainfrom
oherych:added_delay_to_close_DbConn

Conversation

@oherych
Copy link
Copy Markdown

@oherych oherych commented Feb 4, 2026

During load testing we found that we have too many short-lived (< 1 sec) connections to database. We tried to play with the configuration of the connection pool, but we found that Temporal just ignored them. It was the main surprise.

Then I found that we have type DbConn that has an inside counter. And when this counter is zero Temporal force to close all connections to the database. So, if nobody uses the database right now, we just close all connections to it.

For me, is not clear we do this. During the investigation, I found that those changes came from Cadence in 2019. Zero commends why we need them and what problem we try to resolve. Also, I found a similar issue #6459.

So I see two potential way how to resolve it:

  1. Just remove this counter. It is more cheap to keep all connections
  2. Add some delay that should reduce spikes

I prefer the first option, but I implemented the second. This is not production ready and well-tested code. I just want to hear opinions of maintainers.

What changed?

Describe what has changed in this PR.

Why?

Tell your future self why have you made these changes.

How did you test it?

  • built
  • run locally and tested manually
  • covered by existing tests
  • added new unit test(s)
  • added new functional test(s)

Potential risks

Any change is risky. Identify all risks you are aware of. If none, remove this section.

@oherych oherych requested review from a team as code owners February 4, 2026 12:43
@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


Oleh Herych seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@prathyushpv
Copy link
Copy Markdown
Contributor

prathyushpv commented Feb 5, 2026

Hi @oherych, thanks for flagging this! Since you're running a load test, it's possible the database is returning errors that are forcing connection refreshes. If that's the case, your total connection count might still be under the configured limit.

To confirm, could you check the persistence_session_refresh_attempts metric? If this metric is incrementing, that would indicate connection refreshes are happening and likely causing the issue you're seeing.

Another thing to try out is setting maxIdleConns to a higher value. If temporal is deployed through helm chart, it sets the same value for maxIdleConns and maxConns: https://github.com/temporalio/helm-charts/blob/e4a3894a653a5ee8b5841dfb64f14804737531e6/charts/temporal/values/values.postgresql.yaml#L25
You can check if this is true in your deployment and set both maxIdleConns and maxConns to same value.
If maxIdleConns is much lower than maxConns, it could create this churn.

@oherych
Copy link
Copy Markdown
Author

oherych commented Feb 6, 2026

@prathyushpv, thank for your answer.

Do you have information what this logic exists? What sense to have refCnt? What sense to close all connections?

@prathyushpv
Copy link
Copy Markdown
Contributor

Hi @oherych,
The refCnt in DbConn is not related to managing idle connections. It's a resource lifecycle mechanism for the shared sql.DB connection pool as you noted in the description. I also don't have background about this mechanism. This is the PR that added this logic in cadence: cadence-workflow/cadence#1808
We could have followed the approach of keeping all connections until persistence factory is closed(suggestion 1). But I wonder if this could leak connections in some tests where we create and delete persistence stores(I have to check and confirm this).

The problem you have noticed — closing excess connections down to maxIdleConns — is handled entirely by Go's standard database/sql package. When a connection is returned to the pool and the number of idle connections already exceeds maxIdleConns, Go automatically closes the excess one. So I don't think the problem you are facing is caused by refCnt logic. So this change may not fix that.

@oherych
Copy link
Copy Markdown
Author

oherych commented Feb 12, 2026

@prathyushpv, thanks for your time. I found mistake in my process of investigation this issue. We randomly took few stuck traces from log and they all pointed to Close() method. So, when I repeated the same issue in my local env I found that we have closes only in connection pool. We dont close connection in DbConn so often.

@@@ [0x14000520540].NewRefCountedDBConn()
@@@ [0x14000dd4bd8].Get() 0 
@@@ [0x14000dd4bd8].Close() 0 
@@@ [0x14000dd4bd8].ForceClose() 0 
@@@ [0x14001200720].NewRefCountedDBConn()
@@@ [0x14000ae6cf8].Get() 0 
@@@ [0x14001201920].NewRefCountedDBConn()
@@@ [0x140012018c0].Get() 0 
@@@ [0x1400101a420].NewRefCountedDBConn()
@@@ [0x14000dd5cb8].Get() 0 
@@@ [0x140012833e0].NewRefCountedDBConn()
@@@ [0x14001283380].Get() 0 
@@@ [0x14000e6fc80].NewRefCountedDBConn()
@@@ [0x1400105c638].Get() 0 
@@@ [0x14001200180].NewRefCountedDBConn()
@@@ [0x14001200120].Get() 0 
@@@ [0x14001035620].NewRefCountedDBConn()
@@@ [0x14000dd5838].Get() 0 
@@@ [0x14000520c00].NewRefCountedDBConn()
@@@ [0x14000520ba0].Get() 0 
@@@ [0x14001230c00].NewRefCountedDBConn()
@@@ [0x14001450cf8].Get() 0 
@@@ [0x14001450cf8].Close() 0 
@@@ [0x14001450cf8].ForceClose() 0 

And you're absolutely right maxConns == maxIdleConns fixed our problem. We tried this solution before but previous investigation confused me. I wanted to be sure I clear understood why we need this counter

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants