-
Notifications
You must be signed in to change notification settings - Fork 14
ScanJobs
Previous Chapter: Monitors
- Introduction
- Basic usage
- What you can extract
- The time window
- Output modes
- Post processing the extracted data
- Naming the data points
- Plotting
- Programs: reusable command line shortcuts
- Full options reference
When you run a workflow on ScrapyCloud you end up with a large number of finished jobs, each one carrying scrapy stats, log lines, items and the spider arguments it was launched with. Often you want to look across many of those jobs at once: how did the response error rate of a downloader evolve over the last two weeks? how many items did a spider scrape per day? which jobs logged a particular error, and what number did that error report?
scanjobs is the tool for that. It is a small script, based on shub_workflow.utils.scanjobs.ScanJobs,
that scans the jobs of a target spider or script over a period of time, extracts data from their
stats, logs, items or spider arguments using regular expressions, optionally post-processes the
extracted numbers with a tiny stack language, and optionally renders a plot from them.
As with every shub-workflow script, you subclass it once per project so that it knows where to find your jobs and so you can predefine reusable command lines (see Programs):
# myproject/scripts/scanjobs.py
from shub_workflow.utils.scanjobs import ScanJobs as ShubScanJobs
class ScanJobs(ShubScanJobs):
PROGRAMS = {}
if __name__ == "__main__":
import logging
from shub_workflow.utils import get_kumo_loglevel
logging.basicConfig(format="%(asctime)s %(name)s [%(levelname)s]: %(message)s", level=get_kumo_loglevel())
ScanJobs().run()The full implementation lives in
shub_workflow/utils/scanjobs.py,
whose module docstring and the post_process doctests are the authoritative reference for the
behavior described below.
💡 Claude assistance. shub-workflow ships a Claude Code skill,
scanjobs-programs, that teaches Claude how to author, edit and run scanjobs and its programs (including the postscript and plotting languages described here). Install it into your personal Claude scope as described inskills/README.mdand Claude will be able to assist you with everything in this document.
The single positional argument is the target: a spider name, a script name (in the
py:myscript.py form), or * to match everything. As with any shub-workflow script run outside of
ScrapyCloud, you must pass --project-id so the tool knows which project to read jobs from.
The simplest invocation searches a log pattern in the jobs of a script and prints every match:
python scanjobs.py --project-id=<PROJECT_ID> py:deliver.py -l 'delivered .+? items'
By default the tool scans the last day of finished jobs, prints each match to the console and pauses waiting for you to press Enter before continuing to the next one. This interactive mode is meant for visual inspection. It changes automatically as soon as you ask for bulk output (see Output modes).
If your regular expression contains capturing groups, the captured groups become the data extracted from each match — that is what gets post-processed, written and plotted:
python scanjobs.py --project-id=<PROJECT_ID> py:deliver.py -l 'delivered (\d+) items'
There are four sources of data, each selected with its own flag. All of them are repeatable — pass the flag several times to match several things — and all of them take a regular expression.
-
Stats —
-s/--stat-pattern. The pattern is matched against the stat keys. For every matching key, the tool emits the regex groups plus the stat's value. For example-s 'downloader/response_status_count/(5\d\d)'extracts, for each matching stat, the status code (the group) and how many such responses were seen (the value). -
Log lines —
-l/--log-pattern. The pattern is matched against each log message and the regex groups are emitted. Add--countto emit a1per match instead, which is handy for counting occurrences (see Plotting). -
Items —
-i/--item-field-pattern, in the form<jmespath>:<regex>. The JMESPath expression locates a field inside each item, and the regex is matched against its value. If the regex part is empty, the tool simply matches the existence of the field. -
Spider arguments —
-a/--spider-argument-pattern, in the form<arg>:<regex>. This one is special: it does not produce data. Instead it restricts the scan to those jobs whose given spider argument matches the pattern. This lets you focus, say, on the jobs of a generic spider that were launched for a particular source:-a source:mysource.
You can additionally restrict which jobs are scanned with --tag-pattern / --has-tag (only jobs
carrying a tag), --include-running-jobs (by default only finished jobs are scanned), and
--spiders-only / --scripts-only (meaningful only when the target is *).
By default the scan covers the last 86400 seconds (1 day), ending now. Both ends are configurable:
-
-p/--period— the length of the window. Accepts a number of seconds, or any string understood by the timelength library, such as"10 days"or"4h". -
-e/--end-time— the end of the window (defaults to now). Accepts anything dateparser recognizes, e.g."2024-01-31"or"yesterday 18:00".
So --period "7 days" --end-time "2024-02-01" scans the week ending on the 1st of February.
How the extracted data is delivered depends on three mutually-influencing options:
-
Interactive (default) — each match is printed and the scan pauses until you press Enter. Good
for exploring. Use
--first-match-onlyto stop after the first match in each job, and--max-items-per-jobto cap how many items/log lines are inspected per job. -
Write to a file —
-w/--write <file>writes one JSON record per match to a JSON-lines file, with timestamps, and does not pause. This is the mode for collecting a large dataset for further analysis, or to be replayed later. -
Plot —
--plotrenders a chart from the data (see below) and does not pause.
Both --write and --plot (and --generate-items-sample) imply non-interactive scanning; you can
also force it explicitly with --no-user-enter.
Data captured with -w can be replayed without re-scanning the jobs by passing it back with
-r / --read <file> — useful to tweak a plot without paying for another scan.
The data extracted from each match is a tuple of values (regex groups, followed by stat values where
applicable). Frequently the raw values are not what you want to chart — you want a ratio, a sum,
a difference. The -c / --post-process-code option lets you transform that tuple with a tiny
PostScript-like, stack-based language.
The values seed a stack, left to right. The instructions in the -c string are then evaluated left
to right, each one mutating the stack. Whatever remains on the stack at the end is the data point
that gets printed, written or plotted. Numeric strings are coerced to floats by arithmetic
operations; non-numeric leading values (like a label captured by a regex group) are typically left in
place and carried through to become the series label in a plot.
The available instructions are:
-
Arithmetic (pop the two top elements, push the result):
add,sub,mul,div. -
Stack manipulation:
dup(duplicate the top element),pop(discard it),exch(swap the top two),count(push the current stack size), androll— which pops two numbersx yand rotates the topxelements of the stack by one position, to the right ify = 1or to the left ify = -1(the PostScriptrolloperator). -
Flow:
<n> { ... } repeatruns the blockntimes. -
Conversion:
cviconverts the top element to an integer. -
Specials:
prune <n>keeps only the topnelements and discards the rest;holdpreserves the stack across separate extractions within the same job (for instance when you combine a log extraction and a stat extraction and want to operate on both together).
A worked example. Suppose a stat pattern produced this tuple:
('datacenter', '11558', '2500', '9059')
and we apply -c "3 -1 roll pop exch div":
-
3 -1 rollrotates the top three elements one place to the left →('datacenter', '2500', '9059', '11558') -
popdiscards the top →('datacenter', '2500', '9059') -
exch divswaps the last two and divides →('datacenter', 3.6236)
The first element, a label, rode through untouched, while the three numbers were reduced to a single
ratio. The full set of doctested examples lives in the post_process function of
shub_workflow.utils.scanjobs and is the authoritative reference.
Tip: when a
-cexpression depends on a fixed number of stats but some jobs are missing one of them, declare that stat with-d/--safe-default-stat. Missing occurrences then default to0, keeping the tuple arity stable so yourroll/divindices don't drift between jobs.
By default a data point is an anonymous tuple. The --data-headers option turns it into a named
record:
-
--data-headers=name1,name2,...assigns the given names to the values in order. -
--data-headers=autoderives the names automatically, which requires the extraction to alternate text and value (as the stats extraction naturally does: a captured label followed by its value).
Naming is what makes the data legible in the written JSON and, crucially, it is required for plotting, because the plot options refer to the data by name.
With --plot the tool renders a chart with pandas,
seaborn and matplotlib (install them in
your environment first). The option takes a single comma-separated string of key=value pairs
and bare flags. --data-headers is required, and a title= is mandatory. The x-axis defaults to the
match timestamp.
Keys:
| token | meaning |
|---|---|
title=<str> |
required chart title (may contain {var} placeholders — see Programs) |
X=<key> |
x-axis column (default: the match timestamp, tstamp) |
Y=<k1/k2/...> |
y-axis column(s), separated by /. If omitted, all extracted headers except those used by X/hue
|
hue=<key> |
column used to colour / split into series |
tile_key=<key> |
column whose distinct values become separate subplot tiles |
ylabel=<str> |
y-axis label |
xticks=<int> |
maximum number of x ticks |
smooth=<int> |
rolling-window smoothing width |
bins=<n>/<func> |
bin the x-axis into n bins and aggregate each with the pandas function func (sum, mean, std, median, …) |
Bare flags (their mere presence means true):
| flag | meaning |
|---|---|
save |
save the chart to <uuid>.png (also done automatically when no interactive display is available) |
no_tiles |
draw every y key on the same axes instead of separate tiles |
tdiff |
plot the difference between consecutive points instead of the raw values |
Note the two different separators: commas delimit the top-level options, while / separates multiple
y keys inside Y=. A couple of examples:
--data-headers=auto --plot "title=Scraped items per day,bins=30/sum,xticks=20"
--data-headers=auto --plot "Y=error_rate,hue=pool,title=Error rate by pool,smooth=5"
The first counts scraped items aggregated into 30 daily bins; the second draws a smoothed error-rate line per pool.
A useful scanjobs command line tends to be long. To avoid retyping it (and to share it with your
team) you can predefine it as a program in the PROGRAMS dictionary of your ScanJobs subclass,
and then invoke it by alias with -g / --program.
Each entry has a human-readable description and a command_line, which is a list of argv tokens
— a flag and its value are two separate items, never one string:
class ScanJobs(ShubScanJobs):
PROGRAMS = {
"response_profile": {
"description": "Response status profile of a spider. Use with -v spider:<spider>",
"command_line": [
"--project-id=<PROJECT_ID>",
"{spider}",
"-s", "downloader/response_status_count/(\\d+)",
"--data-headers=auto",
"--plot", "title={spider} response profile,no_tiles",
"--period", "7 days",
"-z", "UTC",
],
},
}Run it with:
python scanjobs.py -g response_profile -v spider:myspider
The -v / --program-variables option supplies values in the key1:val1,key2:val2 format. Each
token of the program's command_line is run through Python's str.format(**variables), so any
{placeholder} is filled in — above, both the target and the plot title get the spider name.
There are two subtleties worth internalizing:
-
⚠️ Literal braces must be doubled. Because every token is passed throughstr.format(), any real{or}that belongs to a regular expression or to arepeatblock — i.e. that is not a{placeholder}— must be escaped by doubling it:{{and}}. For example a regex matching a trailing}is written...(\\d+)}}, and a postscript repeat block is written"{{ add }}". Forgetting this raises aKeyErrororValueErrorwhen the program is expanded. -
Explicit flags override the program. Any option you also pass on the command line replaces the one baked into the program (as long as the program left it at its default for that option). So you can reuse a program but, say, widen its window with an extra
-p "30 days", or even re-point a program that hard-codes--project-idat a different project.
To discover the programs defined in a project, read its PROGRAMS dict, or run with an unknown alias
(e.g. -g ?): the tool prints * <alias>: <description> for each one.
| flag | meaning |
|---|---|
<spider> (positional) |
target spider or script (py:script.py); * matches everything |
-s, --stat-pattern
|
regex on stat keys; emits groups + the stat value (repeatable) |
-l, --log-pattern
|
regex on log lines; emits groups (repeatable) |
-i, --item-field-pattern
|
jmespath:regex on items; empty regex matches existence (repeatable) |
-a, --spider-argument-pattern
|
arg:regex; filters the scan to matching jobs (repeatable) |
--tag-pattern |
only scan jobs whose tag matches (repeatable) |
--has-tag |
only scan jobs carrying the given tag (repeatable) |
--include-running-jobs |
also scan running jobs (default: finished only) |
--spiders-only / --scripts-only
|
with target *, restrict to spiders / scripts |
-p, --period
|
window length: seconds or a timelength string (default 86400) |
-e, --end-time
|
window end, any dateparser string (default: now) |
--first-match-only |
stop a job after its first match |
--max-items-per-job |
cap items/log lines scanned per job |
--count |
emit 1 per log match, for counting |
-c, --post-process-code
|
postscript expression over the extracted tuple |
-d, --safe-default-stat
|
a stat that defaults to 0 when absent (repeatable) |
--data-headers |
name the data points; comma list or auto (required by --plot) |
--separate-matches-per-target |
yield each match as its own data point |
--separate-patterns-per-target |
yield matches grouped per pattern per target |
--capture-spiderargs / --capture-joblink
|
with --data-headers, also store job spider args / job URL |
--plot |
render a plot (see Plotting); requires --data-headers
|
-w, --write <file>
|
write captured data points to a JSON-lines file |
-r, --read <file>
|
plot/process from a previously written file instead of scanning |
--generate-items-sample <n> |
with -i, reservoir-sample up to n matched items to a .jl.gz
|
--tstamp-format |
output timestamp format (default %Y-%m-%d %H:%M:%S) |
-z, --zone-info
|
force output timestamps into a zoneinfo zone, e.g. UTC
|
--no-user-enter |
don't pause on each match (implied by --write / --plot) |
--print-progress-each |
progress log cadence in jobs (default 100) |
--project-id |
ScrapyCloud project to read jobs from (required outside ScrapyCloud) |
-g, --program
|
run a predefined PROGRAMS alias |
-v, --program-variables
|
key1:val1,... placeholder values for the program |