Skip to content

Commit df2aee2

Browse files
lukaszraczyloshanduur
authored andcommitted
fix: macb silent TX stall on BCM2712/RP1 (RFC patches from netdev)
Drop the [RFC PATCH net-next 0/3] series 'net: macb: candidate fixes for silent TX stall on BCM2712/RP1' into kernel/build/patches/. Patches mirror the netdev submission byte-for-byte: 0001-net-macb-flush-PCIe-posted-write-after-TSTART-doorbe.patch Read-back of NCR after TSTART so the doorbell reaches the MAC before the function returns. 0002-net-macb-re-check-ISR-after-IER-re-enable-in-macb_tx.patch Re-reads ISR after IER re-enable in macb_tx_poll() to catch TCOMP raised inside the IDR/IER mask window. 0003-net-macb-add-TX-stall-watchdog-as-defence-in-depth-s.patch Per-queue delayed_work safety net that calls macb_tx_restart() if tx_tail hasn't advanced for >= 1 s while the ring is non-empty. All three apply cleanly to the v6.18.24 tarball already pulled by Pkgfile. Lore thread: https://lore.kernel.org/netdev/cover.1777064117.git.lukasz@raczylo.com/T/ Issue invitation: siderolabs/sbc-raspberrypi#91 Signed-off-by: Lukasz Raczylo <lukasz@raczylo.com> Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com> (cherry picked from commit ca3599f)
1 parent c0ef955 commit df2aee2

3 files changed

Lines changed: 345 additions & 0 deletions
Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
From 3106d546d494f2f52ec832e7f7d04f534286e254 Mon Sep 17 00:00:00 2001
2+
Message-ID: <3106d546d494f2f52ec832e7f7d04f534286e254.1777064117.git.lukasz@raczylo.com>
3+
In-Reply-To: <cover.1777064117.git.lukasz@raczylo.com>
4+
References: <cover.1777064117.git.lukasz@raczylo.com>
5+
From: Lukasz Raczylo <lukasz@raczylo.com>
6+
Date: Fri, 24 Apr 2026 21:50:55 +0100
7+
Subject: [RFC PATCH net-next 1/3] net: macb: flush PCIe posted write after
8+
TSTART doorbell
9+
10+
macb_start_xmit() and macb_tx_restart() kick transmission by
11+
OR-ing MACB_BIT(TSTART) into NCR. On PCIe-attached macb instances
12+
(BCM2712 + RP1 PCIe south bridge on Raspberry Pi 5 is the setup we
13+
have in front of us), writes to NCR are posted PCIe writes: they
14+
are not guaranteed to reach the device before the issuing CPU
15+
returns. If the TSTART doorbell does not reach the MAC, no TX
16+
begins, no TCOMP completion arrives, and the ring remains
17+
quiescent without any kernel-visible indication.
18+
19+
Note that the raspberrypi/linux vendor fork carries a local patch
20+
around the TSTART site (a queue->tx_pending breadcrumb that is
21+
promoted to queue->txubr_pending by the next TCOMP interrupt,
22+
triggering macb_tx_restart()). That workaround makes the loss
23+
recoverable under traffic, but it cannot help if TCOMP itself is
24+
not raised because no TX started -- which is exactly the case we
25+
are targeting here. The handshake is not present in mainline.
26+
27+
Add a read-back of NCR after each TSTART write in macb_start_xmit()
28+
and macb_tx_restart(). The read is an architected PCIe read
29+
barrier for earlier posted writes on the same path; it ensures the
30+
doorbell has reached the MAC before the functions return.
31+
32+
We do not yet have direct hardware evidence that TSTART is being
33+
lost on the RP1 path (that would require a PCIe protocol analyser,
34+
or at minimum a before/after counter on queue->tx_stall_last_tail
35+
with and without this patch applied in isolation). This patch is
36+
one of a three-patch series ("candidate fixes for silent TX stall
37+
on BCM2712/RP1"); see the cover letter for context. We have
38+
verified the series compiles and applies cleanly against mainline
39+
HEAD and against raspberrypi/linux rpi-6.18.y @ f2f68e79f16f;
40+
runtime verification is pending.
41+
42+
Link: https://github.com/cilium/cilium/issues/43198
43+
Link: https://bugs.launchpad.net/ubuntu/+source/linux-raspi/+bug/2133877
44+
Signed-off-by: Lukasz Raczylo <lukasz@raczylo.com>
45+
---
46+
drivers/net/ethernet/cadence/macb_main.c | 12 ++++++++++++
47+
1 file changed, 12 insertions(+)
48+
49+
diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
50+
index a12aa2124..b6cca55ad 100644
51+
--- a/drivers/net/ethernet/cadence/macb_main.c
52+
+++ b/drivers/net/ethernet/cadence/macb_main.c
53+
@@ -1922,6 +1922,13 @@ static void macb_tx_restart(struct macb_queue *queue)
54+
55+
spin_lock(&bp->lock);
56+
macb_writel(bp, NCR, macb_readl(bp, NCR) | MACB_BIT(TSTART));
57+
+ /*
58+
+ * Flush the PCIe posted-write queue so the TSTART doorbell
59+
+ * reliably reaches the MAC. Without this, the write can sit
60+
+ * in the fabric and the MAC never advances, causing a silent
61+
+ * TX stall.
62+
+ */
63+
+ (void)macb_readl(bp, NCR);
64+
spin_unlock(&bp->lock);
65+
66+
out_tx_ptr_unlock:
67+
@@ -2560,6 +2567,11 @@ static netdev_tx_t macb_start_xmit(struct sk_buff *skb, struct net_device *dev)
68+
spin_lock(&bp->lock);
69+
macb_tx_lpi_wake(bp);
70+
macb_writel(bp, NCR, macb_readl(bp, NCR) | MACB_BIT(TSTART));
71+
+ /*
72+
+ * Flush the PCIe posted-write queue; see the comment in
73+
+ * macb_tx_restart() for the reasoning.
74+
+ */
75+
+ (void)macb_readl(bp, NCR);
76+
spin_unlock(&bp->lock);
77+
78+
if (CIRC_SPACE(queue->tx_head, queue->tx_tail, bp->tx_ring_size) < 1)
79+
--
80+
2.53.0
81+
Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
From 18740f04225a1778e6938ab5ecfe82087d66eb27 Mon Sep 17 00:00:00 2001
2+
Message-ID: <18740f04225a1778e6938ab5ecfe82087d66eb27.1777064117.git.lukasz@raczylo.com>
3+
In-Reply-To: <cover.1777064117.git.lukasz@raczylo.com>
4+
References: <cover.1777064117.git.lukasz@raczylo.com>
5+
From: Lukasz Raczylo <lukasz@raczylo.com>
6+
Date: Fri, 24 Apr 2026 21:52:05 +0100
7+
Subject: [RFC PATCH net-next 2/3] net: macb: re-check ISR after IER re-enable
8+
in macb_tx_poll
9+
10+
macb_tx_poll() runs with TCOMP masked, drains the TX ring, then
11+
calls napi_complete_done() and re-enables TCOMP via IER. An
12+
existing comment in the function notes:
13+
14+
/* Packet completions only seem to propagate to raise
15+
* interrupts when interrupts are enabled at the time, so if
16+
* packets were sent while interrupts were disabled,
17+
* they will not cause another interrupt to be generated when
18+
* interrupts are re-enabled.
19+
*/
20+
21+
and mitigates this by calling macb_tx_complete_pending(), which
22+
inspects driver-visible ring state (descriptor->ctrl, after rmb())
23+
and reschedules NAPI if a completion is observable in memory.
24+
25+
On PCIe-attached parts (BCM2712 + RP1 on Raspberry Pi 5 is the
26+
setup we have in front of us), the descriptor DMA write that sets
27+
TX_USED may not have retired to system memory at the point
28+
macb_tx_complete_pending() runs. The rmb() synchronises the CPU
29+
view of earlier CPU writes; it is not sufficient to retire an
30+
in-flight peripheral DMA write. Under that ordering the in-memory
31+
descriptor can still read TX_USED=0 when the hardware has in fact
32+
completed the frame; the check returns false; NAPI exits; the
33+
quirk above prevents the re-enabled IER from re-firing; the ring
34+
goes quiescent.
35+
36+
Add an explicit ISR read after the IER write. The MMIO read
37+
serves two independent purposes:
38+
39+
(1) It is an architected PCIe read barrier for earlier
40+
peripheral-originated DMA writes on the same path, so a
41+
subsequent macb_tx_complete_pending() observes any TX_USED
42+
write that was in flight at the time of the barrier.
43+
44+
(2) It samples the hardware ISR directly, so a TCOMP bit that
45+
the hardware set while TCOMP was masked is visible here,
46+
independently of whether the descriptor DMA has retired.
47+
48+
If either signal indicates pending work, reschedule NAPI via the
49+
same path as the existing check.
50+
51+
This patch addresses one of three candidate races for the silent
52+
TX stall described in the cover letter. Whether it is sufficient
53+
by itself, or whether it requires the PCIe posted-write flush in
54+
patch 1/3 to cover the observed behaviour, we have not yet
55+
verified at runtime.
56+
57+
Link: https://github.com/cilium/cilium/issues/43198
58+
Link: https://bugs.launchpad.net/ubuntu/+source/linux-raspi/+bug/2133877
59+
Signed-off-by: Lukasz Raczylo <lukasz@raczylo.com>
60+
---
61+
drivers/net/ethernet/cadence/macb_main.c | 28 +++++++++++++++---------
62+
1 file changed, 18 insertions(+), 10 deletions(-)
63+
64+
diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
65+
index b6cca55ad..ea231b1c5 100644
66+
--- a/drivers/net/ethernet/cadence/macb_main.c
67+
+++ b/drivers/net/ethernet/cadence/macb_main.c
68+
@@ -1973,17 +1973,25 @@ static int macb_tx_poll(struct napi_struct *napi, int budget)
69+
if (work_done < budget && napi_complete_done(napi, work_done)) {
70+
queue_writel(queue, IER, MACB_BIT(TCOMP));
71+
72+
- /* Packet completions only seem to propagate to raise
73+
- * interrupts when interrupts are enabled at the time, so if
74+
- * packets were sent while interrupts were disabled,
75+
- * they will not cause another interrupt to be generated when
76+
- * interrupts are re-enabled.
77+
- * Check for this case here to avoid losing a wakeup. This can
78+
- * potentially race with the interrupt handler doing the same
79+
- * actions if an interrupt is raised just after enabling them,
80+
- * but this should be harmless.
81+
+ /*
82+
+ * TCOMP events that fire while the interrupt is masked do
83+
+ * not re-fire when IER is re-enabled. Catch this two ways
84+
+ * to avoid losing a wakeup:
85+
+ *
86+
+ * (1) Read ISR -- catches completions the hardware flagged
87+
+ * but that we did not see as an interrupt. The MMIO
88+
+ * read doubles as a PCIe read barrier, flushing any
89+
+ * in-flight descriptor TX_USED DMA writes into memory.
90+
+ * (2) macb_tx_complete_pending() inspects the ring after
91+
+ * that flush, catching a descriptor whose TX_USED is
92+
+ * now visible as a result of the barrier.
93+
+ *
94+
+ * This can race with the interrupt handler taking the same
95+
+ * path if an interrupt fires just after the IER write;
96+
+ * rescheduling NAPI in that case is harmless.
97+
*/
98+
- if (macb_tx_complete_pending(queue)) {
99+
+ if ((queue_readl(queue, ISR) & MACB_BIT(TCOMP)) ||
100+
+ macb_tx_complete_pending(queue)) {
101+
queue_writel(queue, IDR, MACB_BIT(TCOMP));
102+
macb_queue_isr_clear(bp, queue, MACB_BIT(TCOMP));
103+
netdev_vdbg(bp->dev, "TX poll: packets pending, reschedule\n");
104+
--
105+
2.53.0
106+
Lines changed: 158 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,158 @@
1+
From c0469642f42ada85d91a8a685eb7c0e04cb99131 Mon Sep 17 00:00:00 2001
2+
Message-ID: <c0469642f42ada85d91a8a685eb7c0e04cb99131.1777064117.git.lukasz@raczylo.com>
3+
In-Reply-To: <cover.1777064117.git.lukasz@raczylo.com>
4+
References: <cover.1777064117.git.lukasz@raczylo.com>
5+
From: Lukasz Raczylo <lukasz@raczylo.com>
6+
Date: Fri, 24 Apr 2026 21:52:06 +0100
7+
Subject: [RFC PATCH net-next 3/3] net: macb: add TX stall watchdog as
8+
defence-in-depth safety net
9+
10+
Patches 1/3 and 2/3 address two candidate races that could lead
11+
to a TCOMP completion being missed on PCIe-attached macb
12+
instances. This patch adds a defence-in-depth safety net, in
13+
case a further race remains that we have not identified.
14+
15+
The watchdog is a per-queue delayed_work that runs once per
16+
second. It snapshots queue->tx_tail; if the ring is non-empty
17+
(queue->tx_head != queue->tx_tail) and tx_tail has not advanced
18+
since the previous tick, it calls macb_tx_restart().
19+
20+
No new recovery logic is introduced. macb_tx_restart() already
21+
exists in this file, is correctly locked (tx_ptr_lock, bp->lock),
22+
and verifies that the hardware's TBQP is behind the driver's
23+
head index before re-asserting TSTART. On a healthy ring it is
24+
a no-op at the hardware level; the watchdog only supplies the
25+
missing trigger.
26+
27+
On a healthy queue the per-tick cost is one spin_lock_irqsave()
28+
/ spin_unlock_irqrestore() and one branch. The delayed_work is
29+
only scheduled between macb_open() and macb_close(), and is
30+
cancelled synchronously on close.
31+
32+
Context for submission: on our 24-node Raspberry Pi 5 fleet,
33+
before this series, an out-of-band user-space watchdog
34+
(monitoring tx_packets from /sys/class/net/.../statistics and
35+
toggling the link down/up when it froze) was required to keep
36+
nodes usable. We include this kernel-side watchdog as a cleaner
37+
in-kernel equivalent for any residual stall that patches 1 and
38+
2 do not cover. We are willing to drop this patch if the view
39+
is that 1 and 2 should stand alone.
40+
41+
Link: https://github.com/cilium/cilium/issues/43198
42+
Link: https://bugs.launchpad.net/ubuntu/+source/linux-raspi/+bug/2133877
43+
Signed-off-by: Lukasz Raczylo <lukasz@raczylo.com>
44+
---
45+
drivers/net/ethernet/cadence/macb.h | 5 ++
46+
drivers/net/ethernet/cadence/macb_main.c | 59 ++++++++++++++++++++++++
47+
2 files changed, 64 insertions(+)
48+
49+
diff --git a/drivers/net/ethernet/cadence/macb.h b/drivers/net/ethernet/cadence/macb.h
50+
index 2de56017e..9115f2b47 100644
51+
--- a/drivers/net/ethernet/cadence/macb.h
52+
+++ b/drivers/net/ethernet/cadence/macb.h
53+
@@ -1278,6 +1278,11 @@ struct macb_queue {
54+
dma_addr_t tx_ring_dma;
55+
struct work_struct tx_error_task;
56+
bool txubr_pending;
57+
+
58+
+ /* TX stall watchdog -- see macb_tx_stall_watchdog() in macb_main.c */
59+
+ struct delayed_work tx_stall_watchdog_work;
60+
+ unsigned int tx_stall_last_tail;
61+
+
62+
struct napi_struct napi_tx;
63+
64+
dma_addr_t rx_ring_dma;
65+
diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c
66+
index ea231b1c5..ea2306ef7 100644
67+
--- a/drivers/net/ethernet/cadence/macb_main.c
68+
+++ b/drivers/net/ethernet/cadence/macb_main.c
69+
@@ -2002,6 +2002,59 @@ static int macb_tx_poll(struct napi_struct *napi, int budget)
70+
return work_done;
71+
}
72+
73+
+#define MACB_TX_STALL_INTERVAL_MS 1000
74+
+
75+
+/*
76+
+ * TX stall watchdog.
77+
+ *
78+
+ * Defence-in-depth against lost TCOMP interrupts. macb already has a
79+
+ * recovery chain (tx_pending -> txubr_pending -> macb_tx_restart())
80+
+ * that fires on TCOMP; if TCOMP itself is lost the TX ring stalls
81+
+ * silently until something else kicks TSTART. This watchdog runs
82+
+ * once per second per queue, snapshots tx_tail, and calls
83+
+ * macb_tx_restart() if the ring is non-empty and tx_tail has not
84+
+ * advanced since the previous tick.
85+
+ *
86+
+ * macb_tx_restart() already checks the hardware's TBQP against the
87+
+ * driver's head index before re-asserting TSTART, so on a healthy
88+
+ * ring this is a no-op at the hardware level. The watchdog only
89+
+ * adds the missing trigger.
90+
+ */
91+
+static void macb_tx_stall_watchdog(struct work_struct *work)
92+
+{
93+
+ struct macb_queue *queue = container_of(to_delayed_work(work),
94+
+ struct macb_queue,
95+
+ tx_stall_watchdog_work);
96+
+ struct macb *bp = queue->bp;
97+
+ unsigned int cur_tail, cur_head;
98+
+ bool stalled = false;
99+
+ unsigned long flags;
100+
+
101+
+ if (!netif_running(bp->dev))
102+
+ return;
103+
+
104+
+ spin_lock_irqsave(&queue->tx_ptr_lock, flags);
105+
+ cur_tail = queue->tx_tail;
106+
+ cur_head = queue->tx_head;
107+
+ if (cur_head != cur_tail &&
108+
+ cur_tail == queue->tx_stall_last_tail)
109+
+ stalled = true;
110+
+ else
111+
+ queue->tx_stall_last_tail = cur_tail;
112+
+ spin_unlock_irqrestore(&queue->tx_ptr_lock, flags);
113+
+
114+
+ if (stalled) {
115+
+ netdev_warn_once(bp->dev,
116+
+ "TX stall detected on queue %u (tail=%u head=%u); re-kicking TSTART\n",
117+
+ (unsigned int)(queue - bp->queues),
118+
+ cur_tail, cur_head);
119+
+ macb_tx_restart(queue);
120+
+ }
121+
+
122+
+ schedule_delayed_work(&queue->tx_stall_watchdog_work,
123+
+ msecs_to_jiffies(MACB_TX_STALL_INTERVAL_MS));
124+
+}
125+
+
126+
static void macb_hresp_error_task(struct work_struct *work)
127+
{
128+
struct macb *bp = from_work(bp, work, hresp_err_bh_work);
129+
@@ -3190,6 +3243,9 @@ static int macb_open(struct net_device *dev)
130+
for (q = 0, queue = bp->queues; q < bp->num_queues; ++q, ++queue) {
131+
napi_enable(&queue->napi_rx);
132+
napi_enable(&queue->napi_tx);
133+
+ queue->tx_stall_last_tail = queue->tx_tail;
134+
+ schedule_delayed_work(&queue->tx_stall_watchdog_work,
135+
+ msecs_to_jiffies(MACB_TX_STALL_INTERVAL_MS));
136+
}
137+
138+
macb_init_hw(bp);
139+
@@ -3240,6 +3296,7 @@ static int macb_close(struct net_device *dev)
140+
for (q = 0, queue = bp->queues; q < bp->num_queues; ++q, ++queue) {
141+
napi_disable(&queue->napi_rx);
142+
napi_disable(&queue->napi_tx);
143+
+ cancel_delayed_work_sync(&queue->tx_stall_watchdog_work);
144+
netdev_tx_reset_queue(netdev_get_tx_queue(dev, q));
145+
}
146+
147+
@@ -4802,6 +4859,8 @@ static int macb_init_dflt(struct platform_device *pdev)
148+
}
149+
150+
INIT_WORK(&queue->tx_error_task, macb_tx_error_task);
151+
+ INIT_DELAYED_WORK(&queue->tx_stall_watchdog_work,
152+
+ macb_tx_stall_watchdog);
153+
q++;
154+
}
155+
156+
--
157+
2.53.0
158+

0 commit comments

Comments
 (0)