🔨 Adding a job counter to address Semaphore issues #408

rm-21 · 2023-06-10T14:03:19Z

Issue #405
Having a semaphore in the same event loop, while having the max number of jobs running, blocks heartbeat & cancellation tasks.

Possible solution

Keep the semaphore value to max_jobs + 1
Have a job_counter that increments when the semaphore is acquired and decrements when the semaphore is released
If the job_counter == max_jobs, release the semaphore immediately and return from the start_jobs function, basically foregoing starting new jobs.

The solution seems to work. The counter is threadsafe because we always increment or decrement it before releasing the semaphore. Quick test app below:

Define the worker definition in arq_app.py
Start 4 workers in different processes from start_workers.py
Schedule more jobs than max for each worker from arq_app.py
Check worker.log for the first heartbeat happening
Cancel a task from cancellations.py
Check worker.log for the latest heartbeat + cancelleation + new job

# arq_app.py
import asyncio
import datetime as dt
from typing import Any

from arq.connections import RedisSettings
from arq import create_pool

REDIS = RedisSettings(host="127.0.0.1", port=6379, database=1)

async def long_task(ctx: dict[Any, Any], num:int, sleeper: float):
    print(f"[{dt.datetime.now()}] {num}: Task entered & started...")

    # Mock call
    await asyncio.sleep(1)
    
    # Fail randomly
    # import random
    # if random.random() > 0.5:
    #     raise RuntimeError(f"[{dt.datetime.now()}] {num}: Task randomly failed...")
    
    # Balance task
    await asyncio.sleep(sleeper)

    print(f"[{dt.datetime.now()}] {num}: Task finished...")



def worker_dispatcher(worker_queue_name: str):
    class ArqWorker:
        # Functions to schedule
        functions = [long_task]

        # ARQ settings
        redis_settings = REDIS
        max_jobs = 5
        job_timeout = 9 * 60 * 60 # 9hrs
        keep_result = 9 * 60 * 60 # 9hrs
        health_check_interval = 5
        retry_jobs = False
        allow_abort_jobs = True

        queue_name = worker_queue_name

    return ArqWorker


async def main():
    redis = await create_pool(settings_=REDIS)

    for idx in range(32):
        await redis.enqueue_job("long_task", idx, 6 * 60 * 60, _queue_name=f"queue{idx % 4}")


if __name__ == "__main__":
    asyncio.run(main())

# start_workers.py
import multiprocessing as mp
import logging

from arq import run_worker
from .arq_app import worker_dispatcher

def start_worker(queue_name: str):
    logging.basicConfig(filename='worker.log', level=logging.INFO,
                        format='%(asctime)s [%(process)d] %(levelname)s: %(message)s')
    logging.info(f'Starting worker for {queue_name}')
    run_worker(worker_dispatcher(queue_name))

if __name__ == "__main__":
    for i in range(4):
        mp.Process(target=start_worker, args=(f"queue{i}",)).start()

# cancellation.py
import asyncio

from arq import create_pool
from arq.jobs import Job
from arq.connections import RedisSettings

REDIS = RedisSettings(host="127.0.0.1", port=6379, database=1)

async def job_status(task_id: str, stop: bool = False) -> None:
    job: Job = Job(
        job_id=task_id,
        redis=await create_pool(settings_=REDIS),
    )

    print(f"ID {task_id}: {await job.status()}")

    if stop:
        result: bool = await job.abort()
        print(f"ID {task_id}: Aborted: {result}")


async def main() -> None:
    await job_status(task_id="258a1b3ec3a84913a42d992b970841da", stop=True)


if __name__ == "__main__":
    asyncio.run(main())

codecov · 2023-06-10T14:04:19Z

Codecov Report

Merging #408 (787bb47) into main (9109c2e) will decrease coverage by 0.27%.
The diff coverage is 80.00%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #408      +/-   ##
==========================================
- Coverage   98.66%   98.40%   -0.27%     
==========================================
  Files          11       11              
  Lines        1052     1063      +11     
  Branches      199      200       +1     
==========================================
+ Hits         1038     1046       +8     
- Misses          6        8       +2     
- Partials        8        9       +1

Impacted Files	Coverage Δ
arq/worker.py	`98.57% <80.00%> (-0.60%)`	⬇️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9109c2e...787bb47. Read the comment docs.

JonasKs · 2023-06-10T15:27:29Z

Thank you! 😊

Would be great if you could add a test that ensure we don't regress in the future. 😊

rm-21 · 2023-06-11T05:36:25Z

Could you suggest a test? I couldn't come up with an appropriate one.

JonasKs · 2023-06-11T06:49:17Z

Set max jobs to 1, queue a task and ensure health check is still logged? This should fail without this implementation, right?

I'm not at home, but I'm pretty sure there's a health check test already implemented.

rm-21 · 2023-06-11T13:45:08Z

I have setup a test that aims to cancel the job when max_jobs queue is full. The test is failing for Python 3.7. The logs show that the job was enqueued and cancelled. There is a stack trace for ConnectionError from Redis in between the 2 log lines. Can you help me debug whenever you can make some time?

I do think just re-running the CI job for 3.7 should resolve this.

JonasKs

Thank you 😊
This looks good to me.

Samuel is busy with v2 of Pydantic, so I wouldn't expect this to be merged until that happens, but you can always use your own fork in the mean time 😊

JonasKs · 2023-06-11T14:11:28Z

arq/worker.py

+
+            if self.job_counter >= self.max_jobs:
+                self.sem.release()
+                return None


Just return here.

samuelcolvin

Sorry the slow reply.

From a brief review, LGTM.

@JonasKs merge if you're happy.

fixes samuelcolvin#405

rm-21 · 2023-10-30T02:55:51Z

Hey @JonasKs! Any update on when can this be merged?

JonasKs · 2023-10-30T18:10:49Z

Sorry this took some time, been a bit busy lately.

As for the release, I'm gonna have to put that back into @samuelcolvin's hands - not sure what he has planned. 😊

samuelcolvin · 2024-04-01T18:06:23Z

v0.26.0b1 is released, please try it, I'll release v0.26 at the end of the week, see #441.

🔨 Adding a job counter to address Semaphore issues

c644ef8

rm-21 marked this pull request as draft June 10, 2023 14:03

rm-21 marked this pull request as ready for review June 10, 2023 14:05

rm-21 changed the title ~~🔨 Adding a job counter to address Semaphore issues~~ 🔨 Issue [#405](https://github.com/samuelcolvin/arq/issues/405): Adding a job counter to address Semaphore issues Jun 10, 2023

rm-21 changed the title ~~🔨 Issue [#405](https://github.com/samuelcolvin/arq/issues/405): Adding a job counter to address Semaphore issues~~ 🔨 Adding a job counter to address Semaphore issues Jun 10, 2023

🧪 Test function for semaphore blocker

787bb47

JonasKs approved these changes Jun 11, 2023

View reviewed changes

arq/worker.py

if self.job_counter >= self.max_jobs:

self.sem.release()

return None

Copy link

Sponsor Collaborator

JonasKs Jun 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just return here.

samuelcolvin approved these changes Sep 13, 2023

View reviewed changes

rossmacarthur pushed a commit to hunchdata/arq that referenced this pull request Sep 19, 2023

fix: add a job counter to address semaphore issues (samuelcolvin#408)

205a5b2

fixes samuelcolvin#405

JonasKs merged commit ab2dda2 into samuelcolvin:main Oct 30, 2023
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🔨 Adding a job counter to address Semaphore issues #408

🔨 Adding a job counter to address Semaphore issues #408

rm-21 commented Jun 10, 2023 •

edited

codecov bot commented Jun 10, 2023 •

edited

JonasKs commented Jun 10, 2023

rm-21 commented Jun 11, 2023

JonasKs commented Jun 11, 2023

rm-21 commented Jun 11, 2023

JonasKs left a comment

JonasKs Jun 11, 2023

samuelcolvin left a comment

rm-21 commented Oct 30, 2023

JonasKs commented Oct 30, 2023

samuelcolvin commented Apr 1, 2024

🔨 Adding a job counter to address Semaphore issues #408

🔨 Adding a job counter to address Semaphore issues #408

Conversation

rm-21 commented Jun 10, 2023 • edited

codecov bot commented Jun 10, 2023 • edited

Codecov Report

JonasKs commented Jun 10, 2023

rm-21 commented Jun 11, 2023

JonasKs commented Jun 11, 2023

rm-21 commented Jun 11, 2023

JonasKs left a comment

Choose a reason for hiding this comment

JonasKs Jun 11, 2023

Choose a reason for hiding this comment

samuelcolvin left a comment

Choose a reason for hiding this comment

rm-21 commented Oct 30, 2023

JonasKs commented Oct 30, 2023

samuelcolvin commented Apr 1, 2024

rm-21 commented Jun 10, 2023 •

edited

codecov bot commented Jun 10, 2023 •

edited