-
Notifications
You must be signed in to change notification settings - Fork 14
Appendix C: Workflow Manager
Previous Chapter: Appendix B: Script Classes
A workflow manager is a long-running script that schedules and supervises a set of related jobs
on Scrapy Cloud, tracking the ones it owns across restarts. WorkFlowManager
(shub_workflow/base.py)
is the base class for that role — it is a BaseLoopScript (see
Appendix B: Script Classes)
plus the name / flow-id / owned-jobs / resume machinery.
You don't usually subclass WorkFlowManager directly; you subclass one of its specializations:
- the crawl managers — Appendix D: Crawl Manager Classes;
- the graph manager — see the Graph Managers chapter.
This appendix documents the shared layer those classes inherit.
BaseLoopScript (Appendix B)
▲
WorkFlowManager name + flow id, owned-jobs tracking, resume, failed outcomes
▲
├── CrawlManager → … (Appendix D)
└── GraphManager (Graph Managers chapter)
A manager must have a name and runs inside a flow id (flow_id_required = True). The name +
flow id are how it recognizes the jobs it owns (children are tagged FLOW_ID=… / PARENT_NAME=…),
which is what makes resuming and get_owned_jobs() work.
| Attribute | Default | Meaning |
|---|---|---|
name |
"" |
required — set it here, or pass it as the first positional CLI arg. Distinct managers in one workflow must have distinct names. |
default_max_jobs |
1000 |
default for --max-running-jobs; the cap on simultaneously-running children. |
flow_id_required |
True |
a flow id is mandatory (auto-generated if not supplied). |
acquire_all_jobs |
False |
if True, acquire owned jobs regardless of flow id (use with care; pair with dont_acquire_finished_jobs). |
dont_acquire_finished_jobs |
False |
if True, don't pull finished children when resuming. |
base_failed_outcomes |
see below | outcomes treated as failures (subclasses route these to a failure hook). |
base_failed_outcomes = ("failed", "killed by oom", "cancelled", "cancel_timeout", "memusage_exceeded", "diskusage_exceeded", "cancelled (stalled)"). At runtime they're copied into
the mutable list self.failed_outcomes, which you may extend (e.g. in __init__).
Positional name (only when the name attribute is unset), and --max-running-jobs
(default default_max_jobs). Plus everything from BaseLoopScript (--loop-mode,
--max-running-time) and BaseScript (--project-id, --flow-id, --children-tag, -g/-v).
| Method | Purpose |
|---|---|
max_running_jobs (property) |
the active cap (--max-running-jobs); override for a dynamic cap. |
get_owned_jobs(project_id=None, **kwargs) |
iterate this manager's children (by flow id + PARENT_NAME); state=[...] required. |
get_finished_owned_jobs(...) |
owned jobs in finished state. |
wait_for(job_keys, interval=60, timeout=inf, heartbeat=None) |
block until the given jobs stop running. |
resume_workflow() / resume_running_job_hook(job) / resume_finished_job_hook(job)
|
re-attach to children on restart; override the hooks to rebuild state. |
generate_flow_id() |
how an auto flow id is produced (uuid4 by default). |
Every job a manager schedules is tagged with the manager's FLOW_ID=<id> and PARENT_NAME=<name>.
get_owned_jobs() filters by exactly those tags, so a manager only ever sees its own children — even
when many managers (or workflow instances) run in the same project. On start the manager auto-resumes
if it detects a previous run with the same name + flow id (or always, when acquire_all_jobs = True):
it replays running children through resume_running_job_hook() and finished ones through
resume_finished_job_hook(), letting a subclass rebuild its in-memory state and continue where it
left off. This is why a consistent name and flow id matter, and why distinct managers in one
workflow must have distinct names.
CachedFinishedJobsMixin
(base.py) is an
optional mixin for managers that check the finish state of many children frequently. It caches
finished owned jobs (refreshed once per loop) and serves is_finished() from that cache, avoiding a
per-job API call each cycle. Mix it in front of a workflow/crawl manager when the per-cycle
is_finished() calls become a bottleneck.
Next Chapter: Appendix D: Crawl Manager Classes