Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC 72: Background workers #72

Closed

Conversation

RealOrangeOne
Copy link
Member

@RealOrangeOne RealOrangeOne commented Sep 17, 2021

@RealOrangeOne RealOrangeOne changed the title RFC: Background workers RFC 72: Background workers Sep 17, 2021
@zerolab zerolab added the 1:New label Sep 17, 2021
Copy link
Member

@lb- lb- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very interesting, nice work! I have put through some misc feedback, sorry if you already addressed some items and I missed it.

Similarly to Django's caching framework, a global "background" object can be imported, which is used to add tasks. This will be a "Huey" instance, potentially subclassed with some Wagtail sugar. This global will be pre-configured based on the application's settings such as backend, immediate mode, and connection details.

```python
from wagtail.contrib.tasks import background
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be a bit early in the process to think about naming, but maybe we should add a section for proposed names.

I think task is already taken up in the context of Workflow/Tasks.

Maybe; job or worker

But I realise Huey uses Task in its terminology so maybe unavoidable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the name here is definitely just a WIP one to fill a gap.

"worker" feels like it's the thing which performs the task, as opposed to the task itself.

"job" is definitely a common enough term that it could be fine. Agree the conflict with workflows isn't ideal. We could be more explicit and call them "background tasks" or alike?

- Is this _contrib_?
- Should Huey be an optional dependency (thus requiring us to implement "immediate mode" ourselves) or a required dependency.
- The background runner should probably have a name of some sort (and be consistent around terminology eg tasks)
- Should Django ORM support be a day-1 feature? (Wider support base, but delays release)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional questions

  • Should we integrate with audit logs or reporting
  • Should there be some kind of UI built to review status/logs from these tasks
  • What other Wagtail (or Django) packages exist
  • Should a more generic problem be looked at (message queue/publish / subscribe etc), for example being able to request a webhook (or set of webhooks) when certain actions are made, probably out of scope but just a thought
  • is it worth preparing a Wagtail package that implements this RFC (as much as possible), in a similar process to https://github.com/wagtail/wagtail-generic-chooser (although, that is not an RFC), this way it can get some practical usage and feedback before going into core... or maybe an isolated package will be good enough

Some relevant links

General feedback

  • I actually thought this week about a Workflow Task that does a background job (e.g. cliche SEO check or maybe runs Google lighthouse on the draft published page), this RFC would be a great candidate for this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be some kind of UI built to review status/logs from these tasks?

I suspect not. These tasks are really just an implementation detail, as opposed to anything someone may want to monitor. With that said, I could definitely see a use for a list of scheduled tasks, giving admins the ability to trigger them manually if they wish?

What other Wagtail (or Django) packages exist

See "Why Huey?" section

Should a more generic problem be looked at (message queue/publish / subscribe etc), for example being able to request a webhook (or set of webhooks) when certain actions are made, probably out of scope but just a thought

I think they can be the same thing. Doing external network requests in the request/response cycle can hurt performance in many cases, so if the webhook doesn't respond immediately it'll slow everything down. Given that, webhooks could definitely be an interesting (future) extension, and probably warrants having background workers regardless.

is it worth preparing a Wagtail package that implements this RFC

In an ideal world, yes, although some of the benefits of background workers are in moving parts of wagtail-core into the background, which can't be done with an external package. With that said, integration with things like signals and hooks could definitely be done externally.

One of the main benefits of this solution though is that it's built in. If it's an external package, it just becomes another competing standard.

Copy link
Member

@tomusher tomusher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great @RealOrangeOne!


By default, Huey has no integrations with Django, allowing it to work in isolation, reducing any potential impact on people's existing sites. Huey does ship with an [optional Django integration](https://huey.readthedocs.io/en/latest/contrib.html#django), but it makes several assumptions about how Huey is used, and would conflict with users already using Huey or with any extended customization we may need.

Whilst this proposal also covers scheduled tasks for Wagtail, enabling those should be opted in separately to task scheduling.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little unclear on the scheduled tasks part of this. Would we replace crons (such as publish_scheduled_pages) as Huey periodic tasks, or would they remain as crons?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My imagining was both. They'd remain as management commands exactly as they are now, but there'd be the option of triggering them as scheduled tasks through a setting in settings.py. I've just realised I've not fleshed out what that API would look like yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Is the main benefit for the user that they wouldn't need to set up separate scheduled tasks if they're already running a worker, or are there other benefits you're considering?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partly that - may save on infrastructure costs. But it also adds the ability for installed packages to "register" scheduled tasks which just get run automatically, without the user having to do anything, or even know / remember they're there.

text/072-background-workers.md Show resolved Hide resolved
text/072-background-workers.md Show resolved Hide resolved

Using this object, tasks are submitted using the existing Huey constructs and patterns.

For Wagtail [hooks](https://docs.wagtail.io/en/stable/reference/hooks.html), there will be an additional property passed when registering the hook, which will transparently convert the hook to a task, and ensure it's submitted as a task when the hook should be called. Only certain hooks will support background tasks. Others, such as those for registering URLs or menu items must be run synchronously, and so will ignore the background argument. This will only be applicable for certain hooks, and will do nothing when passed to these hooks.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it is the responsibility of the caller of the hook to decide whether it should be run asynchronously?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be cases with some hooks where the user wants it to run synchronously, which I've intentionally not ruled out. It will have to be down to the caller of the hook as to whether it's possible for it to be run async, though.

@hazho
Copy link

hazho commented Sep 27, 2021

Great Work Guys, KEEP GOING PLEASE
One NOTE:
make it an optional dependency, because currently many developers don't need all features of wagtail that may need background workers or any automated tasks

@RealOrangeOne
Copy link
Member Author

make it an optional dependency

Completely agree. Wagtail can't be assuming anything on how people are setup. Some hosting environments don't have the ability to run background tasks like this, and we don't want to limit where we can run.

because currently many developers don't need all features of wagtail that may need background workers or any automated tasks

This may be true for the things which are currently run in the background, but as this integration develops, things which currently block web requests will be moved over too, possibly allowing more features if we know they're not time sensitive. It's entirely possible some people won't need the background workers, but I highly doubt that all of those people wouldn't see a benefit from deploying them. Whether that deployment is worthwhile to them is obviously 100% subjective and not something we can assume, again hence making it both optional and opt-in.

@RealOrangeOne
Copy link
Member Author

I've re-written the implementation chunk of this RFC, specifically to change to more of a backend-based approach rather than having Wagtail implement a lot of it itself or depend fully on Huey. This makes the implementation far more generic and powerful, and reduces the amount of complexity added into Wagtail itself.

Thanks to @tomusher for starting the thoughts around this new direction and initial sketches.

@vsalvino
Copy link

Great work writing this up... and I agree this would be very beneficial to have in Wagtail. While the lowest friction option is to use an existing task scheduler system (Huey, Celery, etc.) those are going to have a couple design challenges that will make this feature harder to adopt.

Data storage: Huey supports SQLite and Redis, celery supports Redis. It would be really nice if the datastore could be the same as the wagtail database to avoid adding another layer of complexity. That means task queuing would ideally be backed by Django models. The database is already the source of truth... there is no reason to require yet another service in the mix.

Task runner process: The hosting provider will most likely have to run a second process (aside from WSGI/server) for the task scheduler. It would be really beneficial if whatever task scheduler backends provided by wagtail have a uniform way of spawning this process from Python, similar to how WSGI works (e.g. Django has a wsgi.py file which can be run according to a standard interface). Huey-django uses a management command to spawn this process.

Management of this process becomes more complicated with distributed setups (e.g. people running wagtail via docker images, multi-server setups, etc.). Because no one backend would be in charge of running the process, so you'd want a dedicated process server. I'm not saying we need to solve for that scenario, just provide official guidance on how to handle it.

@RealOrangeOne
Copy link
Member Author

It would be really nice if the datastore could be the same as the wagtail database

The ORM backend will yes use the same database as Wagtail is using to avoid complexity.

The hosting provider will most likely have to run a second process

The original Huey implementation of this RFC allowed both a long running process, and a cron-based "run everything then terminate" style executor. Whilst it's down to whatever task runner a user runs as to how they want to run the actual workers, I think yes the ORM backend in Wagtail should have the ability to run under the cron-style runner so it doesn't depend on a dedicated process. In theory it'd be possible to do things in the same process, but that'd depend strictly on ASGI.

Management of this process becomes more complicated with distributed setups
It does, but it's not difficult. If you are running the application inside Docker, then just changing the entrypoint to run say the worker rather than say gunicorn. Making sure the cron jobs or background process runs becomes out of scope for Wagtail to define, so we can be as versatile as possible, however these 2 methods are very standard, and anyone deploying an application which requires background processes will likely already be familiar with running with them.

Not to today... Just forgot to update it before.
@RealOrangeOne
Copy link
Member Author

Discussion thread -> wagtail/wagtail#7750

(Specific comments on PR still more than welcome!)

Cron-style tasks are much more complex if the job queue doesn't natively support them (which many don't). It's also often much more efficient to handle this scheduling externally.
@lb-
Copy link
Member

lb- commented Sep 30, 2022

Saw this Python library pop up and thought it would be a good link to go here https://rocketry.readthedocs.io/en/stable/

Not saying we should use it but it has a nice API and may serve as some inspiration for parts of this RFC.

@fabienheureux
Copy link
Member

I discovered this one as well that is based on Postgres https://procrastinate.readthedocs.io/en/stable/discussions.html#how-does-this-all-work

No additionnel dependencies needed, only Postgres

@thibaudcolas
Copy link
Member

👋 for anyone interested in this, just wanted to note we’ve provisionally scheduled this on the Wagtail roadmap for version 6.1* (May 2024 release, see RFC 91). I believe @gasman said he was keen to lead on this.

@Tobi-De
Copy link

Tobi-De commented Feb 5, 2024

Hello, everyone! Just chiming in with a package suggestion. There is django-q2, which integrates seamlessly with Django and supports multiple backends, including the Django ORM, Redis, etc.

@carltongibson
Copy link

Hi all.

There's a massive appetite for a solution like this in Django itself.

As per a fedi-thread, I'd like to suggest making this RFC into a DEP and pushing on getting it into Django.

From the thread:

Only issue I could see is we're due to work on this very soon.

To which...

Suggestion: Put it in a third party package (maybe with an import shim in Wagtail if you want it easy to install or automatically installed) and lets begin the Merge to Django conversation. Targeting 5.2 or 6.0 gives time for bugs to show, and folks on 4.2+ can use the external package now. That’s the way.

My view is that Django should have had a story here a while back. It's something that should be part of the core framework. That you're putting in the effort here, it's time. We should point that at core, and then make it happen.

I'm happy to help with a DEP and play the shepherd role there if that's of help.

(Also, ref @Tobi-De's comment, Django-Q(2) has a lot of good bits to take inspiration from.)

@RealOrangeOne
Copy link
Member Author

@carltongibson thanks! ❤️ Expect a draft DEP from me in the coming days.

We discussed this in a Wagtail Core Team meeting this morning, and agree it makes sense to create the initial draft package externally to both Wagtail and Django, to allow Wagtail to use it sooner, with the intention to upstream it into Django in parallel.

@carltongibson
Copy link

@RealOrangeOne AWESOME! 🤩

@Tobi-De
Copy link

Tobi-De commented Feb 7, 2024

@carltongibson thanks! ❤️ Expect a draft DEP from me in the coming days.

We discussed this in a Wagtail Core Team meeting this morning, and agree it makes sense to create the initial draft package externally to both Wagtail and Django, to allow Wagtail to use it sooner, with the intention to upstream it into Django in parallel.

Simply magnificent, the best thing I've read this morning.

@kituuu
Copy link

kituuu commented Mar 5, 2024

Hey, do you know if this is fully implemented? Ig this was one of the past GSOC ideas. If it's not, can it be done in GSOC 2024?

@salty-ivy
Copy link
Member

salty-ivy commented Mar 5, 2024

Hey, do you know if this is fully implemented? Ig this was one of the past GSOC ideas. If it's not, can it be done in GSOC 2024?

This idea has moved to django itself so this framework is going to be developed and shipped as part of django once it's rfc planning is complete and merged

@RealOrangeOne
Copy link
Member Author

The core implementation of this is now a DEP: django/deps#86.

It's likely Wagtail will be one of the first deployed uses of it, but for now the proposal process and discussion has moved there.

@thibaudcolas
Copy link
Member

🥳 Django Enhancement Proposal 14: Background Workers | Django Weblog amazing work @RealOrangeOne!

@thibaudcolas
Copy link
Member

👋 I’m going to close this now, as the RFC has been replaced by a DEP. There’ll be follow-up work needed in Wagtail to leverage those new capabilities but this either won’t require an RFC, or would require one that’s more specific to the use case.

@laymonage
Copy link
Member

Just to add, a draft PR has been made for a POC of turning Wagtail's mechanisms into tasks for django-tasks: wagtail/wagtail#12040

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet