Intermittent test failure: `Test_Client_InsertTriggersImmediateWork` #213

brandur · 2024-02-19T07:38:54Z

Notced this one in #212, but since it's not reproducible, and I've definitely seen it many times before from time to time, I don't believe it's related to my change. Even doing local run counts of 5000 iterations with -race, I'm unable to repro.

Sample failure:

https://github.com/riverqueue/river/actions/runs/7955682330/job/21715052284?pr=212

Run go test -p 1 -race ./...
--- FAIL: Test_Client_InsertTriggersImmediateWork (5.14s)
    logger.go:225: time=2024-02-19T06:52:27.013Z level=INFO msg="River client successfully started" client_id=01HQ026AZWE2GNHS5GK2[8](https://github.com/riverqueue/river/actions/runs/7955682330/job/21715052284?pr=212#step:8:9)383[9](https://github.com/riverqueue/river/actions/runs/7955682330/job/21715052284?pr=212#step:8:10)P
    logger.go:225: time=2024-02-19T06:52:27.013Z level=INFO msg="Election change received" is_leader=false
    logger.go:225: time=2024-02-19T06:52:27.013Z level=INFO msg="producer: Producer started" queue=default
    logger.go:225: time=2024-02-19T06:52:27.013Z level=INFO msg="producer: Run loop started"
    logger.go:225: time=2024-02-19T06:52:27.[10](https://github.com/riverqueue/river/actions/runs/7955682330/job/21715052284?pr=212#step:8:11)6Z level=INFO msg="Election change received" is_leader=true
    logger.go:225: time=2024-02-19T06:52:27.106Z level=INFO msg="PeriodicJobEnqueuer: Run loop started"
    logger.go:225: time=2024-02-19T06:52:27.106Z level=INFO msg="Reindexer: Run loop started"
    logger.go:225: time=2024-02-19T06:52:27.106Z level=INFO msg="Scheduler: Run loop started"
    logger.go:225: time=2024-02-19T06:52:27.106Z level=INFO msg="JobCleaner: Run loop started"
    logger.go:225: time=2024-02-19T06:52:27.106Z level=INFO msg="Rescuer: Run loop started"
    logger.go:225: time=2024-02-19T06:52:27.109Z level=INFO msg="Rescuer: Ran successfully" num_jobs_discarded=0 num_jobs_retry_scheduled=0
    logger.go:225: time=2024-02-19T06:52:27.[11](https://github.com/riverqueue/river/actions/runs/7955682330/job/21715052284?pr=212#step:8:12)1Z level=INFO msg="JobCleaner: Ran successfully" num_jobs_deleted=0
    logger.go:225: time=2024-02-19T06:52:27.[13](https://github.com/riverqueue/river/actions/runs/7955682330/job/21715052284?pr=212#step:8:14)4Z level=INFO msg="Scheduler: Ran successfully" num_jobs_scheduled=0
    logger.go:225: time=2024-02-19T06:52:27.611Z level=INFO msg="Scheduler: Ran successfully" num_jobs_scheduled=0
    logger.go:225: time=2024-02-19T06:52:28.110Z level=INFO msg="Scheduler: Ran successfully" num_jobs_scheduled=0
    logger.go:225: time=2024-02-19T06:52:28.609Z level=INFO msg="Scheduler: Ran successfully" num_jobs_scheduled=0
    logger.go:225: time=2024-02-19T06:52:29.110Z level=INFO msg="Scheduler: Ran successfully" num_jobs_scheduled=0
    logger.go:225: time=2024-02-19T06:52:29.610Z level=INFO msg="Scheduler: Ran successfully" num_jobs_scheduled=0
    logger.go:225: time=2024-02-19T06:52:30.110Z level=INFO msg="Scheduler: Ran successfully" num_jobs_scheduled=0
    logger.go:225: time=2024-02-19T06:52:30.611Z level=INFO msg="Scheduler: Ran successfully" num_jobs_scheduled=0
    logger.go:225: time=2024-02-19T06:52:31.109Z level=INFO msg="Scheduler: Ran successfully" num_jobs_scheduled=0
    logger.go:225: time=2024-02-19T06:52:31.609Z level=INFO msg="Scheduler: Ran successfully" num_jobs_scheduled=0
    logger.go:225: time=2024-02-19T06:52:32.0[14](https://github.com/riverqueue/river/actions/runs/7955682330/job/21715052284?pr=212#step:8:15)Z level=INFO msg="Client: Job stats (since last stats line)" num_jobs_run=1 average_complete_duration=83.[15](https://github.com/riverqueue/river/actions/runs/7955682330/job/21715052284?pr=212#step:8:16)9736ms average_queue_wait_duration=67.223225ms average_run_duration=58.96µs
    logger.go:225: time=2024-02-19T06:52:32.015Z level=INFO msg="producer: Heartbeat" num_completed_jobs=1 num_jobs_running=0 queue=default
    client_test.go:2462: timed out waiting for 2nd job to start
    logger.go:225: time=2024-02-19T06:52:32.098Z level=INFO msg="Client: Stop started"
    logger.go:225: time=2024-02-19T06:52:32.098Z level=INFO msg="producer: Run loop stopped"
    logger.go:225: time=2024-02-19T06:52:32.098Z level=INFO msg="Election change received" is_leader=false
    logger.go:225: time=2024-02-19T06:52:32.098Z level=INFO msg="producer: Producer stopped" queue=default num_completed_jobs=1
    logger.go:225: time=2024-02-19T06:52:32.098Z level=INFO msg="PeriodicJobEnqueuer: Run loop stopped"
    logger.go:225: time=2024-02-19T06:52:32.098Z level=INFO msg="Rescuer: Run loop stopped"
    logger.go:225: time=2024-02-19T06:52:32.099Z level=INFO msg="Scheduler: Run loop stopped"
    logger.go:225: time=2024-02-19T06:52:32.099Z level=INFO msg="Reindexer: Run loop stopped"
    logger.go:225: time=2024-02-19T06:52:32.099Z level=INFO msg="JobCleaner: Run loop stopped"
    logger.go:225: time=2024-02-19T06:52:32.100Z level=INFO msg="Client: All services stopped"
    logger.go:225: time=2024-02-19T06:52:32.100Z level=INFO msg="Client: Stop complete"
FAIL
FAIL	github.com/riverqueue/river	[18](https://github.com/riverqueue/river/actions/runs/7955682330/job/21715052284?pr=212#step:8:19).0[35](https://github.com/riverqueue/river/actions/runs/7955682330/job/21715052284?pr=212#step:8:36)s

The text was updated successfully, but these errors were encountered:

…test Some very, very small logging improvements that log when the notifier had started and stopped, like we already have for many other services. This is an attempt to help diagnose a future CI failure of #213, since I'm completely unable to reproduce this locally regardless of iteration count. Also, augment a log line in the client so it's prefixed with its service name and contains a `client_id`, which is useful here because it helps indicate which client ID was elected leader from logs alone.

…test (#214) Some very, very small logging improvements that log when the notifier had started and stopped, like we already have for many other services. This is an attempt to help diagnose a future CI failure of #213, since I'm completely unable to reproduce this locally regardless of iteration count. Also, augment a log line in the client so it's prefixed with its service name and contains a `client_id`, which is useful here because it helps indicate which client ID was elected leader from logs alone.

bgentry · 2024-02-19T15:38:38Z

I was able to get one failure of this locally with go test -race -run 'Test_Client_InsertTriggersImmediateWork' -count 5000, but not with the new extra logging in place. At least it's not impossible to reproduce locally, just rare.

There are some odd things about the ordering of operations in this test that I'm looking into.

bgentry · 2024-02-22T15:04:33Z

I’ve got some more info on this flaky test. I added some more logging in the test callback function, as well as in the waitForClientHealthy routine. I have 3 different flaky test scenarios so far

For one of the failures, the notifier never comes up as healthy, it’s just stuck in initializing. I'm guessing it's an issue with the connect attempt just hanging on occasion, and we don't have any specific timeout on there to prevent it from hanging indefinitely (if no context timeout or pgx config ConnectTimeout is set).

In my local testing, I believe I am able to fully resolve this issue by setting a tight 1 second context timeout on the notifier's establishConn. However a hardcoded short timeout like that is not viable outside of dev/tests. Maybe we should just set the ConnectTimeout on the pgconn.Config for all our tests? This also feels like it should be documented in a "Postgres best practices" doc page, or included in some kind of automated "is my pool configured sanely for prod workloads" check.

IMO it also highlights the importance of further developing our client health API. The client monitor is the only way to be able to debug this stuff right now, and it's not exposed outside of the river package.
The 2nd scenario is more like the above one you shared. The notifier has started successfully, and yet the 2nd job does not get worked during the test—there's basically a 5 second gap between the 1st job running and the client getting shut down because the test context has ended. Still trying to understand the cause here.
I ended up with a 3rd failure on my latest run. It's a variation of (2), except that after the main test failed there was an additional panic due to logging after the test ended. Seems we were trying to emit a resignation message on a pool that was already closed:
```
panic: Log in goroutine after Test_Client_InsertTriggersImmediateWork has completed: time=2024-02-22T09:02:01.816-06:00 level=ERROR msg="error attempting to resign" elector.err="closed pool"
```

brandur · 2024-02-22T15:53:58Z

Nice investigating. Regarding (1):

As discussed on chat, seems plausible to try decreasing ConnectTimeout for tests.
The fact that the test seems to start also seems to suggest a problem somewhere in waitForClientHealthy and/or the client monitoring infrastructure.

Another strategy we can try here if the connect timeout doesn't fix it is to work our way in from the edges and put more test coverage, including stress tests), on the various components like notifier, then producer. There's almost certainly bugs and rough edges in there, and it'll help tease those out and hopefully improve the overall resilience of Client.

brandur · 2024-03-12T08:21:32Z

Haven't seen this one in days. Definitively fixed by #253.

brandur mentioned this issue Feb 19, 2024

Minor logging improvements; hopefully to help route out intermittent test #214

Merged

This was referenced Feb 24, 2024

better client logs for flaky insert and work test, shorter test connect timeout #224

Merged

Fix intermittent test problem in rivertest + more transactional use #226

Merged

brandur mentioned this issue Mar 8, 2024

Make notifier operations more deterministic + fix intermittency #253

Merged

brandur closed this as completed Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent test failure: `Test_Client_InsertTriggersImmediateWork` #213

Intermittent test failure: `Test_Client_InsertTriggersImmediateWork` #213

brandur commented Feb 19, 2024

bgentry commented Feb 19, 2024

bgentry commented Feb 22, 2024

brandur commented Feb 22, 2024

brandur commented Mar 12, 2024

Intermittent test failure: Test_Client_InsertTriggersImmediateWork #213

Intermittent test failure: Test_Client_InsertTriggersImmediateWork #213

Comments

brandur commented Feb 19, 2024

bgentry commented Feb 19, 2024

bgentry commented Feb 22, 2024

brandur commented Feb 22, 2024

brandur commented Mar 12, 2024

Intermittent test failure: `Test_Client_InsertTriggersImmediateWork` #213

Intermittent test failure: `Test_Client_InsertTriggersImmediateWork` #213