ScanJobs

Previous Chapter: Monitors

Introduction

When you run a workflow on ScrapyCloud you end up with a large number of finished jobs, each one carrying scrapy stats, log lines, items and the spider arguments it was launched with. Often you want to look across many of those jobs at once: how did the response error rate of a downloader evolve over the last two weeks? how many items did a spider scrape per day? which jobs logged a particular error, and what number did that error report?

scanjobs is the tool for that. It is a small script, based on shub_workflow.utils.scanjobs.ScanJobs, that scans the jobs of a target spider or script over a period of time, extracts data from their stats, logs, items or spider arguments using regular expressions, optionally post-processes the extracted numbers with a tiny stack language, and optionally renders a plot from them.

As with every shub-workflow script, you subclass it once per project so that it knows where to find your jobs and so you can predefine reusable command lines (see Programs):

# myproject/scripts/scanjobs.py
from shub_workflow.utils.scanjobs import ScanJobs as ShubScanJobs


class ScanJobs(ShubScanJobs):
    PROGRAMS = {}


if __name__ == "__main__":
    import logging
    from shub_workflow.utils import get_kumo_loglevel

    logging.basicConfig(format="%(asctime)s %(name)s [%(levelname)s]: %(message)s", level=get_kumo_loglevel())
    ScanJobs().run()

The full implementation lives in shub_workflow/utils/scanjobs.py, whose module docstring and the post_process doctests are the authoritative reference for the behavior described below.

💡 Claude assistance. shub-workflow ships a Claude Code skill, scanjobs-programs, that teaches Claude how to author, edit and run scanjobs and its programs (including the postscript and plotting languages described here). Install it into your personal Claude scope as described in skills/README.md and Claude will be able to assist you with everything in this document.

Basic usage

The single positional argument is the target: a spider name, a script name (in the py:myscript.py form), or * to match everything. As with any shub-workflow script run outside of ScrapyCloud, you must pass --project-id so the tool knows which project to read jobs from.

The simplest invocation searches a log pattern in the jobs of a script and prints every match:

python scanjobs.py --project-id=<PROJECT_ID> py:deliver.py -l 'delivered .+? items'

By default the tool scans the last day of finished jobs, prints each match to the console and pauses waiting for you to press Enter before continuing to the next one. This interactive mode is meant for visual inspection. It changes automatically as soon as you ask for bulk output (see Output modes).

If your regular expression contains capturing groups, the captured groups become the data extracted from each match — that is what gets post-processed, written and plotted:

python scanjobs.py --project-id=<PROJECT_ID> py:deliver.py -l 'delivered (\d+) items'

What you can extract

There are four sources of data, each selected with its own flag. All of them are repeatable — pass the flag several times to match several things — and all of them take a regular expression.

Stats — -s / --stat-pattern. The pattern is matched against the stat keys. For every matching key, the tool emits the regex groups plus the stat's value. For example -s 'downloader/response_status_count/(5\d\d)' extracts, for each matching stat, the status code (the group) and how many such responses were seen (the value).
Log lines — -l / --log-pattern. The pattern is matched against each log message and the regex groups are emitted. Add --count to emit a 1 per match instead, which is handy for counting occurrences (see Plotting).
Items — -i / --item-field-pattern, in the form <jmespath>:<regex>. The JMESPath expression locates a field inside each item, and the regex is matched against its value. If the regex part is empty, the tool simply matches the existence of the field.
Spider arguments — -a / --spider-argument-pattern, in the form <arg>:<regex>. This one is special: it does not produce data. Instead it restricts the scan to those jobs whose given spider argument matches the pattern. This lets you focus, say, on the jobs of a generic spider that were launched for a particular source: -a source:mysource.

You can additionally restrict which jobs are scanned with --tag-pattern / --has-tag (only jobs carrying a tag), --include-running-jobs (by default only finished jobs are scanned), and --spiders-only / --scripts-only (meaningful only when the target is *).

The time window

By default the scan covers the last 86400 seconds (1 day), ending now. Both ends are configurable:

-p / --period — the length of the window. Accepts a number of seconds, or any string understood by the timelength library, such as "10 days" or "4h".
-e / --end-time — the end of the window (defaults to now). Accepts anything dateparser recognizes, e.g. "2024-01-31" or "yesterday 18:00".

So --period "7 days" --end-time "2024-02-01" scans the week ending on the 1st of February.

Output modes

How the extracted data is delivered depends on three mutually-influencing options:

Interactive (default) — each match is printed and the scan pauses until you press Enter. Good for exploring. Use --first-match-only to stop after the first match in each job, and --max-items-per-job to cap how many items/log lines are inspected per job.
Write to a file — -w / --write <file> writes one JSON record per match to a JSON-lines file, with timestamps, and does not pause. This is the mode for collecting a large dataset for further analysis, or to be replayed later.
Plot — --plot renders a chart from the data (see below) and does not pause.

Both --write and --plot (and --generate-items-sample) imply non-interactive scanning; you can also force it explicitly with --no-user-enter.

Data captured with -w can be replayed without re-scanning the jobs by passing it back with -r / --read <file> — useful to tweak a plot without paying for another scan.

Post processing the extracted data

The data extracted from each match is a tuple of values (regex groups, followed by stat values where applicable). Frequently the raw values are not what you want to chart — you want a ratio, a sum, a difference. The -c / --post-process-code option lets you transform that tuple with a tiny PostScript-like, stack-based language.

The values seed a stack, left to right. The instructions in the -c string are then evaluated left to right, each one mutating the stack. Whatever remains on the stack at the end is the data point that gets printed, written or plotted. Numeric strings are coerced to floats by arithmetic operations; non-numeric leading values (like a label captured by a regex group) are typically left in place and carried through to become the series label in a plot.

The available instructions are:

Arithmetic (pop the two top elements, push the result): add, sub, mul, div.
Stack manipulation: dup (duplicate the top element), pop (discard it), exch (swap the top two), count (push the current stack size), and roll — which pops two numbers x y and rotates the top x elements of the stack by one position, to the right if y = 1 or to the left if y = -1 (the PostScript roll operator).
Flow: <n> { ... } repeat runs the block n times.
Conversion: cvi converts the top element to an integer.
Specials: prune <n> keeps only the top n elements and discards the rest; hold preserves the stack across separate extractions within the same job (for instance when you combine a log extraction and a stat extraction and want to operate on both together).

A worked example. Suppose a stat pattern produced this tuple:

('datacenter', '11558', '2500', '9059')

and we apply -c "3 -1 roll pop exch div":

3 -1 roll rotates the top three elements one place to the left → ('datacenter', '2500', '9059', '11558')
pop discards the top → ('datacenter', '2500', '9059')
exch div swaps the last two and divides → ('datacenter', 3.6236)

The first element, a label, rode through untouched, while the three numbers were reduced to a single ratio. The full set of doctested examples lives in the post_process function of shub_workflow.utils.scanjobs and is the authoritative reference.

Tip: when a -c expression depends on a fixed number of stats but some jobs are missing one of them, declare that stat with -d / --safe-default-stat. Missing occurrences then default to 0, keeping the tuple arity stable so your roll/div indices don't drift between jobs.

Naming the data points

By default a data point is an anonymous tuple. The --data-headers option turns it into a named record:

--data-headers=name1,name2,... assigns the given names to the values in order.
--data-headers=auto derives the names automatically, which requires the extraction to alternate text and value (as the stats extraction naturally does: a captured label followed by its value).

Naming is what makes the data legible in the written JSON and, crucially, it is required for plotting, because the plot options refer to the data by name.

Plotting

With --plot the tool renders a chart with pandas, seaborn and matplotlib (install them in your environment first). The option takes a single comma-separated string of key=value pairs and bare flags. --data-headers is required, and a title= is mandatory. The x-axis defaults to the match timestamp.

Keys:

token	meaning
`title=<str>`	required chart title (may contain `{var}` placeholders — see Programs)
`X=<key>`	x-axis column (default: the match timestamp, `tstamp`)
`Y=<k1/k2/...>`	y-axis column(s), separated by `/`. If omitted, all extracted headers except those used by `X`/`hue`
`hue=<key>`	column used to colour / split into series
`tile_key=<key>`	column whose distinct values become separate subplot tiles
`ylabel=<str>`	y-axis label
`xticks=<int>`	maximum number of x ticks
`smooth=<int>`	rolling-window smoothing width
`bins=<n>/<func>`	bin the x-axis into `n` bins and aggregate each with the pandas function `func` (`sum`, `mean`, `std`, `median`, …)

Bare flags (their mere presence means true):

flag	meaning
`save`	save the chart to `<uuid>.png` (also done automatically when no interactive display is available)
`no_tiles`	draw every y key on the same axes instead of separate tiles
`tdiff`	plot the difference between consecutive points instead of the raw values

Note the two different separators: commas delimit the top-level options, while / separates multiple y keys inside Y=. A couple of examples:

--data-headers=auto --plot "title=Scraped items per day,bins=30/sum,xticks=20"
--data-headers=auto --plot "Y=error_rate,hue=pool,title=Error rate by pool,smooth=5"

The first counts scraped items aggregated into 30 daily bins; the second draws a smoothed error-rate line per pool.

Programs: reusable command line shortcuts

A useful scanjobs command line tends to be long. To avoid retyping it (and to share it with your team) you can predefine it as a program in the PROGRAMS dictionary of your ScanJobs subclass, and then invoke it by alias with -g / --program.

Each entry has a human-readable description and a command_line, which is a list of argv tokens — a flag and its value are two separate items, never one string:

class ScanJobs(ShubScanJobs):
    PROGRAMS = {
        "response_profile": {
            "description": "Response status profile of a spider. Use with -v spider:<spider>",
            "command_line": [
                "--project-id=<PROJECT_ID>",
                "{spider}",
                "-s", "downloader/response_status_count/(\\d+)",
                "--data-headers=auto",
                "--plot", "title={spider} response profile,no_tiles",
                "--period", "7 days",
                "-z", "UTC",
            ],
        },
    }

Run it with:

python scanjobs.py -g response_profile -v spider:myspider

The -v / --program-variables option supplies values in the key1:val1,key2:val2 format. Each token of the program's command_line is run through Python's str.format(**variables), so any {placeholder} is filled in — above, both the target and the plot title get the spider name.

There are two subtleties worth internalizing:

⚠️ Literal braces must be doubled. Because every token is passed through str.format(), any real { or } that belongs to a regular expression or to a repeat block — i.e. that is not a {placeholder} — must be escaped by doubling it: {{ and }}. For example a regex matching a trailing } is written ...(\\d+)}}, and a postscript repeat block is written "{{ add }}". Forgetting this raises a KeyError or ValueError when the program is expanded.
Explicit flags override the program. Any option you also pass on the command line replaces the one baked into the program (as long as the program left it at its default for that option). So you can reuse a program but, say, widen its window with an extra -p "30 days", or even re-point a program that hard-codes --project-id at a different project.

To discover the programs defined in a project, read its PROGRAMS dict, or run with an unknown alias (e.g. -g ?): the tool prints * <alias>: <description> for each one.

Full options reference

flag	meaning
`<spider>` (positional)	target spider or script (`py:script.py`); `*` matches everything
`-s`, `--stat-pattern`	regex on stat keys; emits groups + the stat value (repeatable)
`-l`, `--log-pattern`	regex on log lines; emits groups (repeatable)
`-i`, `--item-field-pattern`	`jmespath:regex` on items; empty regex matches existence (repeatable)
`-a`, `--spider-argument-pattern`	`arg:regex`; filters the scan to matching jobs (repeatable)
`--tag-pattern`	only scan jobs whose tag matches (repeatable)
`--has-tag`	only scan jobs carrying the given tag (repeatable)
`--include-running-jobs`	also scan running jobs (default: finished only)
`--spiders-only` / `--scripts-only`	with target `*`, restrict to spiders / scripts
`-p`, `--period`	window length: seconds or a `timelength` string (default `86400`)
`-e`, `--end-time`	window end, any `dateparser` string (default: now)
`--first-match-only`	stop a job after its first match
`--max-items-per-job`	cap items/log lines scanned per job
`--count`	emit `1` per log match, for counting
`-c`, `--post-process-code`	postscript expression over the extracted tuple
`-d`, `--safe-default-stat`	a stat that defaults to `0` when absent (repeatable)
`--data-headers`	name the data points; comma list or `auto` (required by `--plot`)
`--separate-matches-per-target`	yield each match as its own data point
`--separate-patterns-per-target`	yield matches grouped per pattern per target
`--capture-spiderargs` / `--capture-joblink`	with `--data-headers`, also store job spider args / job URL
`--plot`	render a plot (see Plotting); requires `--data-headers`
`-w`, `--write <file>`	write captured data points to a JSON-lines file
`-r`, `--read <file>`	plot/process from a previously written file instead of scanning
`--generate-items-sample <n>`	with `-i`, reservoir-sample up to `n` matched items to a `.jl.gz`
`--tstamp-format`	output timestamp format (default `%Y-%m-%d %H:%M:%S`)
`-z`, `--zone-info`	force output timestamps into a `zoneinfo` zone, e.g. `UTC`
`--no-user-enter`	don't pause on each match (implied by `--write` / `--plot`)
`--print-progress-each`	progress log cadence in jobs (default `100`)
`--project-id`	ScrapyCloud project to read jobs from (required outside ScrapyCloud)
`-g`, `--program`	run a predefined `PROGRAMS` alias
`-v`, `--program-variables`	`key1:val1,...` placeholder values for the program

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ScanJobs

Introduction

Basic usage

What you can extract

The time window

Output modes

Post processing the extracted data

Naming the data points

Plotting

Programs: reusable command line shortcuts

Full options reference

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Tutorial TOC

Using Frontera (deprecated)

Appendices

Clone this wiki locally