Skip to content

Appendix C: Workflow Manager

Martin Olveyra edited this page Jun 18, 2026 · 1 revision

Previous Chapter: Appendix B: Script Classes


Introduction

A workflow manager is a long-running script that schedules and supervises a set of related jobs on Scrapy Cloud, tracking the ones it owns across restarts. WorkFlowManager (shub_workflow/base.py) is the base class for that role — it is a BaseLoopScript (see Appendix B: Script Classes) plus the name / flow-id / owned-jobs / resume machinery.

You don't usually subclass WorkFlowManager directly; you subclass one of its specializations:

This appendix documents the shared layer those classes inherit.

Class hierarchy

BaseLoopScript                       (Appendix B)
      ▲
WorkFlowManager                      name + flow id, owned-jobs tracking, resume, failed outcomes
      ▲
      ├── CrawlManager  → …          (Appendix D)
      └── GraphManager               (Graph Managers chapter)

WorkFlowManager

A manager must have a name and runs inside a flow id (flow_id_required = True). The name + flow id are how it recognizes the jobs it owns (children are tagged FLOW_ID=… / PARENT_NAME=…), which is what makes resuming and get_owned_jobs() work.

Class attributes

Attribute Default Meaning
name "" required — set it here, or pass it as the first positional CLI arg. Distinct managers in one workflow must have distinct names.
default_max_jobs 1000 default for --max-running-jobs; the cap on simultaneously-running children.
flow_id_required True a flow id is mandatory (auto-generated if not supplied).
acquire_all_jobs False if True, acquire owned jobs regardless of flow id (use with care; pair with dont_acquire_finished_jobs).
dont_acquire_finished_jobs False if True, don't pull finished children when resuming.
base_failed_outcomes see below outcomes treated as failures (subclasses route these to a failure hook).

base_failed_outcomes = ("failed", "killed by oom", "cancelled", "cancel_timeout", "memusage_exceeded", "diskusage_exceeded", "cancelled (stalled)"). At runtime they're copied into the mutable list self.failed_outcomes, which you may extend (e.g. in __init__).

CLI arguments added

Positional name (only when the name attribute is unset), and --max-running-jobs (default default_max_jobs). Plus everything from BaseLoopScript (--loop-mode, --max-running-time) and BaseScript (--project-id, --flow-id, --children-tag, -g/-v).

Key methods

Method Purpose
max_running_jobs (property) the active cap (--max-running-jobs); override for a dynamic cap.
get_owned_jobs(project_id=None, **kwargs) iterate this manager's children (by flow id + PARENT_NAME); state=[...] required.
get_finished_owned_jobs(...) owned jobs in finished state.
wait_for(job_keys, interval=60, timeout=inf, heartbeat=None) block until the given jobs stop running.
resume_workflow() / resume_running_job_hook(job) / resume_finished_job_hook(job) re-attach to children on restart; override the hooks to rebuild state.
generate_flow_id() how an auto flow id is produced (uuid4 by default).

The owned-jobs model

Every job a manager schedules is tagged with the manager's FLOW_ID=<id> and PARENT_NAME=<name>. get_owned_jobs() filters by exactly those tags, so a manager only ever sees its own children — even when many managers (or workflow instances) run in the same project. On start the manager auto-resumes if it detects a previous run with the same name + flow id (or always, when acquire_all_jobs = True): it replays running children through resume_running_job_hook() and finished ones through resume_finished_job_hook(), letting a subclass rebuild its in-memory state and continue where it left off. This is why a consistent name and flow id matter, and why distinct managers in one workflow must have distinct names.

CachedFinishedJobsMixin

CachedFinishedJobsMixin (base.py) is an optional mixin for managers that check the finish state of many children frequently. It caches finished owned jobs (refreshed once per loop) and serves is_finished() from that cache, avoiding a per-job API call each cycle. Mix it in front of a workflow/crawl manager when the per-cycle is_finished() calls become a bottleneck.


Next Chapter: Appendix D: Crawl Manager Classes

Clone this wiki locally