Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent test failure: Test_Client_InsertTriggersImmediateWork #213

Closed
brandur opened this issue Feb 19, 2024 · 4 comments
Closed

Intermittent test failure: Test_Client_InsertTriggersImmediateWork #213

brandur opened this issue Feb 19, 2024 · 4 comments

Comments

@brandur
Copy link
Contributor

brandur commented Feb 19, 2024

Notced this one in #212, but since it's not reproducible, and I've definitely seen it many times before from time to time, I don't believe it's related to my change. Even doing local run counts of 5000 iterations with -race, I'm unable to repro.

Sample failure:

https://github.com/riverqueue/river/actions/runs/7955682330/job/21715052284?pr=212

Run go test -p 1 -race ./...
--- FAIL: Test_Client_InsertTriggersImmediateWork (5.14s)
    logger.go:225: time=2024-02-19T06:52:27.013Z level=INFO msg="River client successfully started" client_id=01HQ026AZWE2GNHS5GK2[8](https://github.com/riverqueue/river/actions/runs/7955682330/job/21715052284?pr=212#step:8:9)383[9](https://github.com/riverqueue/river/actions/runs/7955682330/job/21715052284?pr=212#step:8:10)P
    logger.go:225: time=2024-02-19T06:52:27.013Z level=INFO msg="Election change received" is_leader=false
    logger.go:225: time=2024-02-19T06:52:27.013Z level=INFO msg="producer: Producer started" queue=default
    logger.go:225: time=2024-02-19T06:52:27.013Z level=INFO msg="producer: Run loop started"
    logger.go:225: time=2024-02-19T06:52:27.[10](https://github.com/riverqueue/river/actions/runs/7955682330/job/21715052284?pr=212#step:8:11)6Z level=INFO msg="Election change received" is_leader=true
    logger.go:225: time=2024-02-19T06:52:27.106Z level=INFO msg="PeriodicJobEnqueuer: Run loop started"
    logger.go:225: time=2024-02-19T06:52:27.106Z level=INFO msg="Reindexer: Run loop started"
    logger.go:225: time=2024-02-19T06:52:27.106Z level=INFO msg="Scheduler: Run loop started"
    logger.go:225: time=2024-02-19T06:52:27.106Z level=INFO msg="JobCleaner: Run loop started"
    logger.go:225: time=2024-02-19T06:52:27.106Z level=INFO msg="Rescuer: Run loop started"
    logger.go:225: time=2024-02-19T06:52:27.109Z level=INFO msg="Rescuer: Ran successfully" num_jobs_discarded=0 num_jobs_retry_scheduled=0
    logger.go:225: time=2024-02-19T06:52:27.[11](https://github.com/riverqueue/river/actions/runs/7955682330/job/21715052284?pr=212#step:8:12)1Z level=INFO msg="JobCleaner: Ran successfully" num_jobs_deleted=0
    logger.go:225: time=2024-02-19T06:52:27.[13](https://github.com/riverqueue/river/actions/runs/7955682330/job/21715052284?pr=212#step:8:14)4Z level=INFO msg="Scheduler: Ran successfully" num_jobs_scheduled=0
    logger.go:225: time=2024-02-19T06:52:27.611Z level=INFO msg="Scheduler: Ran successfully" num_jobs_scheduled=0
    logger.go:225: time=2024-02-19T06:52:28.110Z level=INFO msg="Scheduler: Ran successfully" num_jobs_scheduled=0
    logger.go:225: time=2024-02-19T06:52:28.609Z level=INFO msg="Scheduler: Ran successfully" num_jobs_scheduled=0
    logger.go:225: time=2024-02-19T06:52:29.110Z level=INFO msg="Scheduler: Ran successfully" num_jobs_scheduled=0
    logger.go:225: time=2024-02-19T06:52:29.610Z level=INFO msg="Scheduler: Ran successfully" num_jobs_scheduled=0
    logger.go:225: time=2024-02-19T06:52:30.110Z level=INFO msg="Scheduler: Ran successfully" num_jobs_scheduled=0
    logger.go:225: time=2024-02-19T06:52:30.611Z level=INFO msg="Scheduler: Ran successfully" num_jobs_scheduled=0
    logger.go:225: time=2024-02-19T06:52:31.109Z level=INFO msg="Scheduler: Ran successfully" num_jobs_scheduled=0
    logger.go:225: time=2024-02-19T06:52:31.609Z level=INFO msg="Scheduler: Ran successfully" num_jobs_scheduled=0
    logger.go:225: time=2024-02-19T06:52:32.0[14](https://github.com/riverqueue/river/actions/runs/7955682330/job/21715052284?pr=212#step:8:15)Z level=INFO msg="Client: Job stats (since last stats line)" num_jobs_run=1 average_complete_duration=83.[15](https://github.com/riverqueue/river/actions/runs/7955682330/job/21715052284?pr=212#step:8:16)9736ms average_queue_wait_duration=67.223225ms average_run_duration=58.96µs
    logger.go:225: time=2024-02-19T06:52:32.015Z level=INFO msg="producer: Heartbeat" num_completed_jobs=1 num_jobs_running=0 queue=default
    client_test.go:2462: timed out waiting for 2nd job to start
    logger.go:225: time=2024-02-19T06:52:32.098Z level=INFO msg="Client: Stop started"
    logger.go:225: time=2024-02-19T06:52:32.098Z level=INFO msg="producer: Run loop stopped"
    logger.go:225: time=2024-02-19T06:52:32.098Z level=INFO msg="Election change received" is_leader=false
    logger.go:225: time=2024-02-19T06:52:32.098Z level=INFO msg="producer: Producer stopped" queue=default num_completed_jobs=1
    logger.go:225: time=2024-02-19T06:52:32.098Z level=INFO msg="PeriodicJobEnqueuer: Run loop stopped"
    logger.go:225: time=2024-02-19T06:52:32.098Z level=INFO msg="Rescuer: Run loop stopped"
    logger.go:225: time=2024-02-19T06:52:32.099Z level=INFO msg="Scheduler: Run loop stopped"
    logger.go:225: time=2024-02-19T06:52:32.099Z level=INFO msg="Reindexer: Run loop stopped"
    logger.go:225: time=2024-02-19T06:52:32.099Z level=INFO msg="JobCleaner: Run loop stopped"
    logger.go:225: time=2024-02-19T06:52:32.100Z level=INFO msg="Client: All services stopped"
    logger.go:225: time=2024-02-19T06:52:32.100Z level=INFO msg="Client: Stop complete"
FAIL
FAIL	github.com/riverqueue/river	[18](https://github.com/riverqueue/river/actions/runs/7955682330/job/21715052284?pr=212#step:8:19).0[35](https://github.com/riverqueue/river/actions/runs/7955682330/job/21715052284?pr=212#step:8:36)s
brandur added a commit that referenced this issue Feb 19, 2024
…test

Some very, very small logging improvements that log when the notifier
had started and stopped, like we already have for many other services.
This is an attempt to help diagnose a future CI failure of #213, since
I'm completely unable to reproduce this locally regardless of iteration
count.

Also, augment a log line in the client so it's prefixed with its service
name and contains a `client_id`, which is useful here because it helps
indicate which client ID was elected leader from logs alone.
bgentry pushed a commit that referenced this issue Feb 19, 2024
…test (#214)

Some very, very small logging improvements that log when the notifier
had started and stopped, like we already have for many other services.
This is an attempt to help diagnose a future CI failure of #213, since
I'm completely unable to reproduce this locally regardless of iteration
count.

Also, augment a log line in the client so it's prefixed with its service
name and contains a `client_id`, which is useful here because it helps
indicate which client ID was elected leader from logs alone.
@bgentry
Copy link
Contributor

bgentry commented Feb 19, 2024

I was able to get one failure of this locally with go test -race -run 'Test_Client_InsertTriggersImmediateWork' -count 5000, but not with the new extra logging in place. At least it's not impossible to reproduce locally, just rare.

There are some odd things about the ordering of operations in this test that I'm looking into.

@bgentry
Copy link
Contributor

bgentry commented Feb 22, 2024

I’ve got some more info on this flaky test. I added some more logging in the test callback function, as well as in the waitForClientHealthy routine. I have 3 different flaky test scenarios so far

  1. For one of the failures, the notifier never comes up as healthy, it’s just stuck in initializing. I'm guessing it's an issue with the connect attempt just hanging on occasion, and we don't have any specific timeout on there to prevent it from hanging indefinitely (if no context timeout or pgx config ConnectTimeout is set).

    In my local testing, I believe I am able to fully resolve this issue by setting a tight 1 second context timeout on the notifier's establishConn. However a hardcoded short timeout like that is not viable outside of dev/tests. Maybe we should just set the ConnectTimeout on the pgconn.Config for all our tests? This also feels like it should be documented in a "Postgres best practices" doc page, or included in some kind of automated "is my pool configured sanely for prod workloads" check.

    IMO it also highlights the importance of further developing our client health API. The client monitor is the only way to be able to debug this stuff right now, and it's not exposed outside of the river package.

  2. The 2nd scenario is more like the above one you shared. The notifier has started successfully, and yet the 2nd job does not get worked during the test—there's basically a 5 second gap between the 1st job running and the client getting shut down because the test context has ended. Still trying to understand the cause here.

  3. I ended up with a 3rd failure on my latest run. It's a variation of (2), except that after the main test failed there was an additional panic due to logging after the test ended. Seems we were trying to emit a resignation message on a pool that was already closed:

    panic: Log in goroutine after Test_Client_InsertTriggersImmediateWork has completed: time=2024-02-22T09:02:01.816-06:00 level=ERROR msg="error attempting to resign" elector.err="closed pool"
    

@brandur
Copy link
Contributor Author

brandur commented Feb 22, 2024

Nice investigating. Regarding (1):

  • As discussed on chat, seems plausible to try decreasing ConnectTimeout for tests.
  • The fact that the test seems to start also seems to suggest a problem somewhere in waitForClientHealthy and/or the client monitoring infrastructure.

Another strategy we can try here if the connect timeout doesn't fix it is to work our way in from the edges and put more test coverage, including stress tests), on the various components like notifier, then producer. There's almost certainly bugs and rough edges in there, and it'll help tease those out and hopefully improve the overall resilience of Client.

@brandur
Copy link
Contributor Author

brandur commented Mar 12, 2024

Haven't seen this one in days. Definitively fixed by #253.

@brandur brandur closed this as completed Mar 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants