Skip to content

test(e2e): add cross mesh granularity#490

Open
kvaps wants to merge 7 commits intosquat:mainfrom
cozystack:feat/cross-mesh-granularity
Open

test(e2e): add cross mesh granularity#490
kvaps wants to merge 7 commits intosquat:mainfrom
cozystack:feat/cross-mesh-granularity

Conversation

@kvaps
Copy link
Copy Markdown
Contributor

@kvaps kvaps commented Apr 28, 2026

Summary

This rebases #328 (--mesh-granularity=cross by @skirsten) onto current
main and adds the e2e test suite that was the only blocker for merge,
per the discussion in #489.

The three original commits from #328 are preserved as-is to keep
authorship attribution. A small conflict in docs/kg.md was resolved
in favour of the current --mtu description (which moved to the
auto default in #406) while keeping #328's addition of cross to
the granularity list.

What cross does

Direct WireGuard tunnels between every pair of nodes that live in
different locations; intra-location traffic stays on the CNI overlay.
This sits between location (one tunnel per location pair, leader as
relay → SPOF) and full (one tunnel per node pair, including
intra-location overhead).

New e2e suite (e2e/cross-mesh.sh)

Mirrors e2e/full-mesh.sh and e2e/location-mesh.sh:

  • setup_suite annotates kind nodes into two locations (control-plane
    • first worker as loc-a, second worker as loc-b) so the test
      exercises the cross-location case
  • test_cross_mesh_connectivity — pings + adjacency matrix
  • test_cross_mesh_peerkgctl peer create / showconf with
    granularity cross
  • test_mesh_granularity_auto_detectkgctl graph auto-detection
  • test_cross_peer_topology — sanity that loc-a nodes only see the
    loc-b node as a peer (and vice versa), distinguishing cross from
    full (every node is a peer) and location (non-leaders have no
    peers at all)

Wired into the e2e: make target between location-mesh.sh and
multi-cluster.sh.

Refs

squat pushed a commit that referenced this pull request Apr 28, 2026
Adds my GitHub username to the breakpoint authorized-users so I can
SSH into the runner when the e2e job fails on PRs I'm involved in
(currently #490 / #491). Per maintainer's suggestion in #489.

Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
skirsten and others added 6 commits April 28, 2026 14:35
Mirrors e2e/full-mesh.sh and e2e/location-mesh.sh for the new
--mesh-granularity=cross mode introduced by the preceding commits.

setup_suite annotates the kind nodes into two locations (control-plane
and the first worker as loc-a, the second worker as loc-b) so the test
exercises the case "cross" is meant to handle: direct WireGuard
tunnels between locations, native CNI inside a location.

Tests:
- test_cross_mesh_connectivity: pings + adjacency matrix
- test_cross_mesh_peer: kgctl peer create/showconf
- test_mesh_granularity_auto_detect: kgctl graph auto-detection
- test_cross_peer_topology: sanity that loc-a nodes see only the
  loc-b node as a peer (and vice versa), distinguishing "cross"
  from "full" (where every node is a peer) and "location" (where
  non-leaders have no peers at all)

The new suite is wired into the existing e2e make target between
location-mesh.sh and multi-cluster.sh.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
The "cross from {a,b,c,d}" test cases added in 3590b12 predate the
cniCompatibilityIPs field on segment, introduced by Cilium support
in squat#409. Each segment in the cross test cases describes a single
node, so the expected value mirrors the existing full/location
cases: []*net.IPNet{nil}.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
The cross granularity intentionally removes the WireGuard tunnel
between nodes that share a location and relies on the underlying CNI
to carry intra-location pod traffic over its own overlay (e.g. Cilium
VXLAN). The bridge CNI used by the e2e harness has no such overlay,
so check_ping/check_adjacent cannot succeed on this cluster — they
were timing out trying to reach the same-location worker.

Keep the topology checks (peer count per node, kgctl graph
auto-detect, kgctl peer create), which validate the cross routing
logic without depending on a CNI overlay. End-to-end connectivity
under cross is covered by the Cilium-CNI suite added separately.

Also clean up the location annotations in teardown_suite so the
suites that follow (multi-cluster, handlers, kgctl) start from the
same node-annotation state they used to.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
@kvaps kvaps force-pushed the feat/cross-mesh-granularity branch from d1ef6fd to 840d14f Compare April 28, 2026 12:35
Just removing the location annotations leaves the DaemonSet in
--mesh-granularity=cross. The handler tests that follow assume the
control-plane WireGuard IP is 10.4.0.1 (the leader of a single-
location mesh) and time out when cross's per-node leader assignment
hands that IP to a different node.

Roll the DaemonSet back to --mesh-granularity=location in the
teardown so the cluster state mirrors what location-mesh.sh leaves
behind, which is the working baseline expected by multi-cluster.sh,
handlers.sh and kgctl.sh.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
@skirsten
Copy link
Copy Markdown

Thanks for picking this back up.

FYI: While I was running this in my clusters for years and it worked great in general, there was one issue which I observed from time to time but could not track down. It might not even be related to this change:

When adding nodes to the cluster sometimes the networking would be weird and I was seeing a 10Hz (IIRC) reconcile metric on Grafana.
By rolling out a restart for the daemonset it resolved it every time. So I am pretty sure there is a race condition or some kind of distributed feedback loop happening.

I am not using it anymore at the moment so I don't have any concrete metrics or traces I can share unfortunately.

@kvaps
Copy link
Copy Markdown
Contributor Author

kvaps commented Apr 28, 2026

@skirsten thanks for chiming in!

For the rebase + attribution: to keep this clean and merge under
your name, I pushed a ready-to-use branch with your three commits
cherry-picked onto current main, plus one fixture-only commit
that aligns the cross test cases with the new cniCompatibilityIPs
field on segment introduced by #409 (Cilium support). The branch
is at:

https://github.com/cozystack/kilo/tree/add-cross-mesh-granularity-rebased

You can fast-forward your fork's PR branch onto it with:

git remote add cozystack https://github.com/cozystack/kilo.git
git fetch cozystack add-cross-mesh-granularity-rebased
git checkout add-cross-mesh-granularity        # the branch behind #328
git reset --hard cozystack/add-cross-mesh-granularity-rebased
git push --force-with-lease origin add-cross-mesh-granularity

That refreshes #328 with three of your commits unchanged plus the
single follow-up. If you'd rather drop the fourth commit, that's
fine too — I can carry it in a separate PR after #328 merges. Once
your PR is in I'll close #490 and open a small follow-up with just
the e2e suite.

About the 10 Hz reconcile loop you saw: I had a quick look at the
current applyTopology path and couldn't pin down a clear cause,
but the reconcile path has moved on quite a bit since 2022 —
nodesAreEqual compares more fields now (CNICompatibilityIP,
Granularity, AllowedLocationIPs), and PR #409 adds annotation
write-back from each node. If the symptom resurfaces in current
master we can chase it as a separate issue with a kilo_reconciles_total
panel attached.

@kvaps kvaps marked this pull request as ready for review April 28, 2026 15:17
@kvaps kvaps changed the title Add cross mesh granularity test(e2e): add cross mesh granularity Apr 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants