Possible connection re-use issue in `0.0.24`

We just put `0.0.24` into production and it's starting to get into a state where `river` is logging hundreds of times a second with

```json
{
    "level": "error",
    "time": "2024-03-01T23:56:08.679305457Z",
    "notifier": {
        "err": {
            "error": "connection already established",
            "kind": "*errors.errorString",
            "stack": null
        }
    },
    "subsystem": "river",
    "message": "error establishing connection from pool"
}
```

via:

- https://github.com/riverqueue/river/blob/035ba5919e944c1ad027582b877192cc16516fc2/internal/notifier/notifier.go#L130
- https://github.com/riverqueue/river/blob/035ba5919e944c1ad027582b877192cc16516fc2/riverdriver/riverpgxv5/river_pgx_v5_driver.go#L491

(My apologies for the lack of a stack trace)

This is thrashing CPU significantly (the small bumps are the deployments that haven't rolled out `0.0.24` yet):

<img width="1177" alt="Screenshot 2024-03-01 at 5 15 37 PM" src="https://github.com/riverqueue/river/assets/520669/3c68676e-6882-4d9b-8814-47ceef0df15a">

I managed to capture the lead up to **ONE** of these bursts and it looks like a TCP error on the PostgreSQL connection started the death spiral

```json
{
    "level": "info",
    "time": "2024-03-01T23:56:07.097802512Z",
    "num_completed_jobs": 40,
    "num_jobs_running": 0,
    "queue": "default",
    "subsystem": "river",
    "message": "producer: Heartbeat"
}

{
    "level": "error",
    "time": "2024-03-01T23:56:08.679126682Z",
    "notifier": {
        "err": {
            "error": "tls: failed to send closeNotify alert (but connection was closed anyway): write tcp 10.122.48.181:44034->10.122.30.240:5432: i/o timeout",
            "kind": "*fmt.wrapError",
            "stack": null
        }
    },
    "subsystem": "river",
    "message": "error closing listener"
}

{
    "level": "error",
    "time": "2024-03-01T23:56:08.679305457Z",
    "notifier": {
        "err": {
            "error": "connection already established",
            "kind": "*errors.errorString",
            "stack": null
        }
    },
    "subsystem": "river",
    "message": "error establishing connection from pool"
}
```

I'm sorry I don't have anything more conclusive right now but wanted to at least get this in your hands in case something rings a bell.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible connection re-use issue in `0.0.24` #248

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Possible connection re-use issue in 0.0.24 #248

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Possible connection re-use issue in `0.0.24` #248