Appendix C: Workflow Manager

Previous Chapter: Appendix B: Script Classes

Introduction

A workflow manager is a long-running script that schedules and supervises a set of related jobs on Scrapy Cloud, tracking the ones it owns across restarts. WorkFlowManager (shub_workflow/base.py) is the base class for that role — it is a BaseLoopScript (see Appendix B: Script Classes) plus the name / flow-id / owned-jobs / resume machinery.

You don't usually subclass WorkFlowManager directly; you subclass one of its specializations:

the crawl managers — Appendix D: Crawl Manager Classes;
the graph manager — see the Graph Managers chapter.

This appendix documents the shared layer those classes inherit.

Class hierarchy

BaseLoopScript                       (Appendix B)
      ▲
WorkFlowManager                      name + flow id, owned-jobs tracking, resume, failed outcomes
      ▲
      ├── CrawlManager  → …          (Appendix D)
      └── GraphManager               (Graph Managers chapter)

WorkFlowManager

A manager must have a name and runs inside a flow id (flow_id_required = True). The name + flow id are how it recognizes the jobs it owns (children are tagged FLOW_ID=… / PARENT_NAME=…), which is what makes resuming and get_owned_jobs() work.

Class attributes

Attribute	Default	Meaning
`name`	`""`	required — set it here, or pass it as the first positional CLI arg. Distinct managers in one workflow must have distinct names.
`default_max_jobs`	`1000`	default for `--max-running-jobs`; the cap on simultaneously-running children.
`flow_id_required`	`True`	a flow id is mandatory (auto-generated if not supplied).
`acquire_all_jobs`	`False`	if `True`, acquire owned jobs regardless of flow id (use with care; pair with `dont_acquire_finished_jobs`).
`dont_acquire_finished_jobs`	`False`	if `True`, don't pull finished children when resuming.
`base_failed_outcomes`	see below	outcomes treated as failures (subclasses route these to a failure hook).

base_failed_outcomes = ("failed", "killed by oom", "cancelled", "cancel_timeout", "memusage_exceeded", "diskusage_exceeded", "cancelled (stalled)"). At runtime they're copied into the mutable list self.failed_outcomes, which you may extend (e.g. in __init__).

CLI arguments added

Positional name (only when the name attribute is unset), and --max-running-jobs (default default_max_jobs). Plus everything from BaseLoopScript (--loop-mode, --max-running-time) and BaseScript (--project-id, --flow-id, --children-tag, -g/-v).

Key methods

Method	Purpose
`max_running_jobs` (property)	the active cap (`--max-running-jobs`); override for a dynamic cap.
`get_owned_jobs(project_id=None, **kwargs)`	iterate this manager's children (by flow id + `PARENT_NAME`); `state=[...]` required.
`get_finished_owned_jobs(...)`	owned jobs in `finished` state.
`wait_for(job_keys, interval=60, timeout=inf, heartbeat=None)`	block until the given jobs stop running.
`resume_workflow()` / `resume_running_job_hook(job)` / `resume_finished_job_hook(job)`	re-attach to children on restart; override the hooks to rebuild state.
`generate_flow_id()`	how an auto flow id is produced (uuid4 by default).

The owned-jobs model

Every job a manager schedules is tagged with the manager's FLOW_ID=<id> and PARENT_NAME=<name>. get_owned_jobs() filters by exactly those tags, so a manager only ever sees its own children — even when many managers (or workflow instances) run in the same project. On start the manager auto-resumes if it detects a previous run with the same name + flow id (or always, when acquire_all_jobs = True): it replays running children through resume_running_job_hook() and finished ones through resume_finished_job_hook(), letting a subclass rebuild its in-memory state and continue where it left off. This is why a consistent name and flow id matter, and why distinct managers in one workflow must have distinct names.

CachedFinishedJobsMixin

CachedFinishedJobsMixin (base.py) is an optional mixin for managers that check the finish state of many children frequently. It caches finished owned jobs (refreshed once per loop) and serves is_finished() from that cache, avoiding a per-job API call each cycle. Mix it in front of a workflow/crawl manager when the per-cycle is_finished() calls become a bottleneck.

Next Chapter: Appendix D: Crawl Manager Classes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Appendix C: Workflow Manager

Introduction

Class hierarchy

WorkFlowManager

Class attributes

CLI arguments added

Key methods

The owned-jobs model

CachedFinishedJobsMixin

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Tutorial TOC

Using Frontera (deprecated)

Appendices

Clone this wiki locally