We just put 0.0.24 into production and it's starting to get into a state where river is logging hundreds of times a second with
{
"level": "error",
"time": "2024-03-01T23:56:08.679305457Z",
"notifier": {
"err": {
"error": "connection already established",
"kind": "*errors.errorString",
"stack": null
}
},
"subsystem": "river",
"message": "error establishing connection from pool"
}
via:
|
n.logger.Error("error establishing connection from pool", "err", err) |
|
return errors.New("connection already established") |
(My apologies for the lack of a stack trace)
This is thrashing CPU significantly (the small bumps are the deployments that haven't rolled out 0.0.24 yet):
I managed to capture the lead up to ONE of these bursts and it looks like a TCP error on the PostgreSQL connection started the death spiral
{
"level": "info",
"time": "2024-03-01T23:56:07.097802512Z",
"num_completed_jobs": 40,
"num_jobs_running": 0,
"queue": "default",
"subsystem": "river",
"message": "producer: Heartbeat"
}
{
"level": "error",
"time": "2024-03-01T23:56:08.679126682Z",
"notifier": {
"err": {
"error": "tls: failed to send closeNotify alert (but connection was closed anyway): write tcp 10.122.48.181:44034->10.122.30.240:5432: i/o timeout",
"kind": "*fmt.wrapError",
"stack": null
}
},
"subsystem": "river",
"message": "error closing listener"
}
{
"level": "error",
"time": "2024-03-01T23:56:08.679305457Z",
"notifier": {
"err": {
"error": "connection already established",
"kind": "*errors.errorString",
"stack": null
}
},
"subsystem": "river",
"message": "error establishing connection from pool"
}
I'm sorry I don't have anything more conclusive right now but wanted to at least get this in your hands in case something rings a bell.
We just put
0.0.24into production and it's starting to get into a state whereriveris logging hundreds of times a second with{ "level": "error", "time": "2024-03-01T23:56:08.679305457Z", "notifier": { "err": { "error": "connection already established", "kind": "*errors.errorString", "stack": null } }, "subsystem": "river", "message": "error establishing connection from pool" }via:
river/internal/notifier/notifier.go
Line 130 in 035ba59
river/riverdriver/riverpgxv5/river_pgx_v5_driver.go
Line 491 in 035ba59
(My apologies for the lack of a stack trace)
This is thrashing CPU significantly (the small bumps are the deployments that haven't rolled out
0.0.24yet):I managed to capture the lead up to ONE of these bursts and it looks like a TCP error on the PostgreSQL connection started the death spiral
{ "level": "info", "time": "2024-03-01T23:56:07.097802512Z", "num_completed_jobs": 40, "num_jobs_running": 0, "queue": "default", "subsystem": "river", "message": "producer: Heartbeat" } { "level": "error", "time": "2024-03-01T23:56:08.679126682Z", "notifier": { "err": { "error": "tls: failed to send closeNotify alert (but connection was closed anyway): write tcp 10.122.48.181:44034->10.122.30.240:5432: i/o timeout", "kind": "*fmt.wrapError", "stack": null } }, "subsystem": "river", "message": "error closing listener" } { "level": "error", "time": "2024-03-01T23:56:08.679305457Z", "notifier": { "err": { "error": "connection already established", "kind": "*errors.errorString", "stack": null } }, "subsystem": "river", "message": "error establishing connection from pool" }I'm sorry I don't have anything more conclusive right now but wanted to at least get this in your hands in case something rings a bell.