Issue with spike of connection to database. Potential solution#9211
Issue with spike of connection to database. Potential solution#9211oherych wants to merge 1 commit intotemporalio:mainfrom
Conversation
|
Oleh Herych seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
|
Hi @oherych, thanks for flagging this! Since you're running a load test, it's possible the database is returning errors that are forcing connection refreshes. If that's the case, your total connection count might still be under the configured limit. To confirm, could you check the persistence_session_refresh_attempts metric? If this metric is incrementing, that would indicate connection refreshes are happening and likely causing the issue you're seeing. Another thing to try out is setting maxIdleConns to a higher value. If temporal is deployed through helm chart, it sets the same value for maxIdleConns and maxConns: https://github.com/temporalio/helm-charts/blob/e4a3894a653a5ee8b5841dfb64f14804737531e6/charts/temporal/values/values.postgresql.yaml#L25 |
|
@prathyushpv, thank for your answer. Do you have information what this logic exists? What sense to have |
|
Hi @oherych, The problem you have noticed — closing excess connections down to maxIdleConns — is handled entirely by Go's standard database/sql package. When a connection is returned to the pool and the number of idle connections already exceeds maxIdleConns, Go automatically closes the excess one. So I don't think the problem you are facing is caused by refCnt logic. So this change may not fix that. |
|
@prathyushpv, thanks for your time. I found mistake in my process of investigation this issue. We randomly took few stuck traces from log and they all pointed to And you're absolutely right |
During load testing we found that we have too many short-lived (< 1 sec) connections to database. We tried to play with the configuration of the connection pool, but we found that Temporal just ignored them. It was the main surprise.
Then I found that we have
type DbConnthat has an inside counter. And when this counter is zero Temporal force to close all connections to the database. So, if nobody uses the database right now, we just close all connections to it.For me, is not clear we do this. During the investigation, I found that those changes came from Cadence in 2019. Zero commends why we need them and what problem we try to resolve. Also, I found a similar issue #6459.
So I see two potential way how to resolve it:
I prefer the first option, but I implemented the second. This is not production ready and well-tested code. I just want to hear opinions of maintainers.
What changed?
Describe what has changed in this PR.
Why?
Tell your future self why have you made these changes.
How did you test it?
Potential risks
Any change is risky. Identify all risks you are aware of. If none, remove this section.