v0.2.4: mss-clamp directive + reconfigure CLI + ExecReload#44
Merged
Conversation
Closes the SPEC §11.4 MSS-clamping gap: standard iptables `TCPMSS
--set-mss` rules don't fire on fast-pathed flows because XDP redirect
via bpf_redirect_map skips netfilter entirely. This was hitting real
operators on edge routers where customer subnets are in `allow-prefix`
and the MSS clamp on the WAN-egress FORWARD rule is required for TCP
to behave across MTU-limited tunnels.
mss-clamp ships as a directive inside the existing fast-path module,
not a separate module — bpf_redirect_map bypasses TC egress, and the
multi-module dispatcher isn't built yet, so a separate-program clamp
has nowhere to hook on fast-pathed traffic. Inline mutation in the
existing XDP program before redirect is the only correct approach
today and remains correct after the dispatcher lands (mss-clamp would
just be one of fast-path's policies).
Four grammars accepted, looked up in order of specificity:
mss-clamp <cidr> via <iface> <mtu> # most specific
mss-clamp <cidr> <mtu> # prefix-scoped, any egress
mss-clamp via <iface> <mtu> # egress-iface, any prefix
mss-clamp <mtu> # global default
Prefix matches src OR dst (mirrors allow-prefix); clamp applies to
SYN AND SYN-ACK (per-direction MSS is independent in TCP); lower-if-
higher policy (matches iptables `TCPMSS --set-mss` semantics).
Bundled with two related items the operator runbook ties together:
* `packetframe reconfigure` CLI subcommand. The SIGHUP plumbing in
the daemon was already wired (loader::reconfigure_from_signal,
fast-path's reconcile delta paths) — what was missing was the CLI
side. Now sends SIGHUP via PIDFile + /proc/<pid>/comm validation
and polls a `last-reconfigure.timestamp` ack marker for up to 5s.
Distinguishes parse errors from per-module reconcile failures via
exit code, so scripted operator flows can react.
* ExecReload + PIDFile in the systemd unit. Was deliberately omitted
in #39 pending this work; the README's `systemctl reload
packetframe` instruction now resolves.
* block-prefix reconcile path (v0.2.1 BLOCK_V4/V6 maps existed but
had no reconcile path — adding/removing block-prefix lines required
full restart). Mirrors allow-prefix line for line.
Two new counters at SPEC §4.6 indices 33-34 (mss_clamp_applied,
mss_clamp_skipped) — append-only per CLAUDE.md guardrail. FpCfg
layout bumped V1 -> V2 (the formerly-_reserved [u8; 2] slot is now
mss_clamp_global: u16; struct size unchanged).
Userspace + BPF map additions: MSS_CLAMP_V{4,6} (LPM tries with
MssClampValue { mss, iface_filter }) + MSS_CLAMP_BY_IFACE (HashMap
keyed on egress ifindex). Lookup precedence in BPF: src-prefix LPM
-> dst-prefix LPM -> per-iface HashMap -> CFG.mss_clamp_global.
TCP options walk is verifier-bounded (40-iteration cap; max one MSS
option in 40 bytes); RFC 1624 incremental checksum update mirrors
the existing decrement_ipv4_ttl pattern.
Docs: new runbooks docs/runbooks/{mss-clamp,reconfigure}.md, README
Status table expanded to enumerate the sketched future modules
(ddos / sampler / randomizer / dispatcher) with SPEC §-pointers and
priority buckets, mss-clamp + reconfigure rows added as v0.2.4+
production. conf/example.conf grew an mss-clamp section.
Tests: 14 new parser tests covering all four grammars, IPv4 + IPv6,
boundary checks (88 / 65495), duplicate-arg rejection, missing-args
errors. All 94 + 40 lib tests pass; cargo fmt + clippy clean.
Out of scope (deferred):
* Separate `mss-clamp` module with the multi-module dispatcher.
* Hot-reload of attach iface set / route-source / circuit-breaker /
local-prefix.
* `packetframe reconfigure --check` (dry-run validation).
* Type=notify migration of the systemd unit.
SPEC.md update summary (private spec — apply manually):
* §3.2 priority taxonomy: mss-clamp lives at priority 1500
(1000-1999 forwarding/transformation bucket).
* §4.x new "MSS clamping" section: lookup precedence, SYN+SYN-ACK
semantics, lower-if-higher rule, IPv4/IPv6 parity, skipped-counter
conditions.
* §4.6 counter table: append rows 33 (mss_clamp_applied) and 34
(mss_clamp_skipped).
* §8.4 reconfigure UX: clarify CLI subcommand mechanics (SIGHUP via
PIDFile + ack-timestamp poll), explicit list of what's hot-
reloadable as of v0.2.4 vs what isn't.
* §5.x dispatcher placeholder: record that any second module on the
same hook needs the dispatcher first; mss-clamp avoiding-this by
living inside fast-path validates the design.
* §11.4: close out — "Resolved in v0.2.4 via fast-path mss-clamp
directive (not a separate module — bpf_redirect_map bypasses TC
egress, dispatcher prerequisite for separate-program approach)."
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The nested BPF build runs cargo with `--message-format=json` so build.rs can extract the artifact path. JSON messages go to stdout, human-readable progress + summary lines go to stderr. The previous forward_output(&[], &out.stderr) call only forwarded the latter, which meant rustc's actual diagnostic content (errors + warnings) was completely silent — operators saw "BPF build failed (exit 101) — due to 1 previous error; 10 warnings emitted" with no clue what the error was. Add forward_rendered_diagnostics() that walks the JSON stream and extracts each compiler-message's `rendered` field (the same text rustc would print to stderr without --message-format=json). Wire it into the failure path so operators actually see what broke. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI surfaced "Looks like the BPF stack limit is exceeded" — the
mss_clamp_inline + lookup_mss_clamp_v{4,6} functions, all marked
#[inline(always)], were being inlined into forward_success which
itself is inlined into the already-large fast_path XDP entry. The
cumulative stack frame blew past BPF's 512-byte limit.
Switch all three to #[inline(never)] so LLVM emits them as
separate BPF subprograms, each with its own 512-byte budget.
Block-scope the LPM Key locals inside the lookup functions so LLVM
can reuse the same stack slot for src vs dst keys rather than
carrying both live.
The error was invisible until the previous build.rs commit started
forwarding rustc JSON diagnostics; without that, "due to 1 previous
error; 10 warnings emitted" was all operators saw.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…bprogram
The previous fix split mss_clamp_inline into a separate BPF
subprogram to escape the 512-byte stack budget. CI then surfaced
a kernel verifier rejection at attach time:
4738: (67) r1 <<= 32
R1 pointer arithmetic with <<= operator prohibited
BPF subprograms can't accept packet pointers as arguments — LLVM
emits a 32->64 bit lift via shift-left for the pkt_ptr arg, and
the verifier prohibits arithmetic shifts on packet pointers. Pass
a scalar `ip_offset` instead and reconstruct the IP-header pointer
inside the subprogram via `ctx.data() + ip_offset`, which the
verifier tracks as a fresh packet-derived pointer with the bounds
check immediately following. Same pattern the existing `ptr_at`
helper uses.
Also drop the `let _ = proto;` workaround — proto is no longer
destructured into the outer scope, the early-return guards inside
each branch are sufficient.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reverting to #[inline(always)] for mss_clamp_inline + lookup
helpers because BPF subprograms can't take packet-pointer args.
Two earlier attempts ran into the verifier:
- Pass `ip: *mut u8` directly: rejected
"R1 pointer arithmetic with <<= operator prohibited"
- Pass `ip_offset: usize` via &XdpContext: rejected
"R2 pointer arithmetic on pkt_end prohibited" — LLVM SROA
decomposed &XdpContext into (data, data_end) packet pointers
behind our backs.
Inlining is the only verifier-friendly option for code that
touches the packet. The stack overflow that prompted the subprogram
attempt is mitigated by:
- Reading src/dst addresses inside block scopes within the
lookup helpers, not as locals in the parent frame. The Key
locals + addr arrays are released when each block exits.
- Letting the lookup helpers take a pointer-to-IP-header rather
than addresses by value, so the caller doesn't pre-materialize
them.
- Reading the proto byte first and bailing on non-TCP, the
common case for fast-pathed traffic.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous run hit "BPF program is too large. Processed 1000001 insn" — the 0..40 loop with the match-based kind dispatch caused combinatorial state explosion in the BPF verifier (max_states_per_insn was 23, total_states 29302). Stack was fine (0+360+0+0). Reduce to 0..8 iterations: real SYN packets put MSS in the first 1-4 options (Linux's tcp_options_write emits MSS very early), and 8 is comfortable headroom while keeping verifier state-space bounded. Worst-case behavior change: SYNs that bury MSS deeper than option-position 8 won't get clamped — that's pathological in practice and we'd bump mss_clamp_skipped on those. Also flatten the inner `match` to sequential `if`s (kind == 0, kind == 1, length-prefixed) — fewer branches per iteration cuts the state-space split factor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bundled into the v0.2.4-mss-clamp PR so the release tag flow is a single merge + tag rather than a separate release-prep PR. Bumped: workspace Cargo.toml (0.2.3 -> 0.2.4), VERSION, both README install snippets (v0.2.3 -> v0.2.4 for the .deb and tarball blocks). Cargo.lock cascaded via `cargo check`. BPF crates at crates/modules/*/bpf/ stay at 0.0.1 -- independent versioning, never tracked the workspace. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes the SPEC §11.4 MSS-clamping gap and lands the long-deferred reconfigure CLI /
systemctl reloadpath in one bundled PR.mss-clampdirective (fast-path). Inline mutation of the TCP MSS option in matched SYN/SYN-ACK packets before redirect. Fixes the iptables-bypass gap on real edge deployments whereallow-prefixoverlaps a-j TCPMSSFORWARD rule. Four grammars (global / per-iface / per-prefix / per-prefix+iface), SYN+SYN-ACK both directions, lower-if-higher policy, IPv4 + IPv6 parity, RFC 1624 incremental checksum. See docs/runbooks/mss-clamp.md.packetframe reconfiguresubcommand. SIGHUP plumbing was already wired (loader.rs:reconfigure_from_signal, reconcile.rs) — what was missing was the CLI side. Now uses PIDFile +/proc/<pid>/commvalidation, sends SIGHUP, and polls alast-reconfigure.timestampack marker for up to 5s. Exit code distinguishes parse errors from per-module reconcile failures. See docs/runbooks/reconfigure.md.ExecReload=+PIDFile=in the systemd unit (crates/cli/debian/packetframe.service). The README's aspirationalsystemctl reload packetframefinally works.block-prefixreconcile path (small adjacent fix). v0.2.1 shippedBLOCK_V4/V6maps but no reconcile, so adding/removingblock-prefixlines required restart. ~50-LOC mirror of the existing allow-prefix reconcile.build.rsimprovement. Forward BPF rustc JSON diagnostics on failure so future BPF compile errors are visible — without this,cargo::warning=[bpf stderr]only carried "Compiling X" progress lines and the final summary, hiding actual rustc diagnostics. Found while debugging this PR's BPF stack/verifier issues.Why bundle
What's in the PR
MssClampValue,MSS_CLAMP_V4/V6/BY_IFACE.FpCfglayout V1 → V2 (the formerly-_reservedslot is nowmss_clamp_global: u16; struct size unchanged).StatIdx33/34 =mss_clamp_applied/skipped(append-only per CLAUDE.md).mss_clamp_inline(and helpers) called fromforward_successbetween devmap check and TTL decrement. All#[inline(always)]because BPF subprograms can't take packet-pointer args (see "BPF lessons" below). Verifier-bounded TCP-options walk capped at 8 iterations with sequentialifbranches (state-space-friendly). RFC 1624 incremental TCP checksum update.ModuleDirective::MssClamp+MssClampPrefixenum. Parser disambiguates the four grammars by token shape (CIDR contains/,viais the iface keyword). Bounds check 88 ≤ mss ≤ 65495.populate_mss_clamp(load),reconcile_mss_clamp(SIGHUP delta),reconcile_block_v4/v6. The mss reconcile uses re-insert-on-key semantics so a value change for an existing prefix is picked up.Reconfigurehandler replaces the stub. Newloader::reconfigure()+ReconfigureError. PIDFile lifecycle inrun_linux. Marker file written byreconfigure_from_signal.ExecReload=/bin/kill -HUP $MAINPID,PIDFile=....compiler-messagerecords, emitrenderedfield as cargo warnings).BPF lessons learned (worth recording for future XDP work)
The BPF side took five iterations after the initial design:
#[inline(never)]to escape stack budget — failed. BPF kernel verifier rejects bpf2bpf calls with packet-pointer arguments:R1 pointer arithmetic with <<= operator prohibited. LLVM emits a 32→64 bit lift via shift-left for the pkt_ptr arg to satisfy the BPF calling convention; the verifier prohibits arithmetic shifts on packet pointers.ip_offset: usize(scalar) instead — failed. LLVM's SROA decomposes&XdpContextinto(data, data_end)packet pointers across the call boundary. Same verifier rejection on the data_end arg.#[inline(always)], trim stack via block scopes — partially worked. Keeps everything in the parent's frame; reading src/dst inside LPM-lookup blocks lets the compiler reuse stack slots between src and dst keys.BPF program is too large). Combinatorial state explosion across the innermatch-based kind dispatch.matchto sequentialif— final fix. Real SYN packets put MSS in option positions 1-4 (Linux'stcp_options_writeemits MSS very early); 8 is comfortable headroom. SYNs that bury MSS deeper get counted asmss_clamp_skipped(pathological in practice).Takeaway captured in crates/modules/fast-path/bpf/src/main.rs doc-comments and SPEC §4.12: for BPF code that touches packet pointers,
#[inline(always)]is essentially mandatory; stack-budget pressure has to be solved by tightening locals, not by extracting subprograms.Tests
14 new parser tests in config.rs covering all four grammars (IPv4 + IPv6), boundary values (88, 65495), out-of-range rejection, missing-arg errors, and multi-line accumulation.
cargo fmt --checkclean.cargo clippy --workspace --all-targets --all-features -- -D warningsclean.CI
The qemu jobs run the real BPF program through the kernel verifier on attach, so this confirms the byte-level mutation logic is verifier-clean on both 5.15 and 6.6.
Test plan (post-merge, on the deployed router)
apt install ./packetframe_0.2.4_amd64.deb.mss-clamp 23.191.201.0/24 via eth2 1360(replacing the equivalent iptables-j TCPMSSrule).sudo packetframe reconfigure.packetframe statusshowsmss_clamp_appliedclimbing on customer SYN traffic.tcpdump -i eth2 -n 'tcp[tcpflags] & tcp-syn != 0' -c 5 -vvconfirms the wire MSS is 1360.-j TCPMSSrule once mss-clamp is confirmed firing.packetframe reconfigureexits non-zero with parse-error detail; daemon stays up with old config.Out of scope (explicitly)
mss-clampmodule (requires the dispatcher first; SPEC §5.0).packetframe reconfigure --check(dry-run validation).Type=notifymigration of the systemd unit.Tag flow after merge
🤖 Generated with Claude Code