### 1 Limited‑breadth Monte‑Carlo planner

For each goal depth $g\in\{1,2,3\}$, decision index $d\in\{1,2,3\}$ and state $s$ we pre‑tabulate two discovery probabilities $\bigl(p^{\text{base}}_{g,d,s},\,p^{\text{base2}}_{g,d,s}\bigr)$ indicating whether **one** or **two** goal‑leading actions exist (non‑zero entries are listed in Table S1).

The planner’s effective breadth—i.e. the expected number of roll‑outs per node—is

$$
b = 9\,\sigma(\tilde b),\qquad 
b_2 = 5\,\sigma(\tilde b_2) \tag{1a}
$$

with the logistic transform $\sigma(x)=(1+e^{-x})^{-1}$.

Experience counters accumulate deterministically:

$$
\begin{aligned}
E^{(1)}_{t,d,s} &= E^{(1)}_{t-1,d,s} + b\,\mathbf 1\!\bigl[p^{\text{base}}_{g,d,s}>0\bigr],\\
E^{(2)}_{t,d,s} &= E^{(2)}_{t-1,d,s} + b_2\,\mathbf 1\!\bigl[p^{\text{base2}}_{g,d,s}>0\bigr].
\end{aligned}
$$

The resulting probabilities of discovering two, one or zero goal‑leading actions are

$$
\begin{aligned}
p^{(2)}_{t,d,s} &= \min\!\bigl(p^{\text{base2}}_{g,d,s}\,E^{(2)}_{t,d,s},1\bigr),\\
p^{(1)}_{t,d,s} &= \min\!\bigl(p^{\text{base}}_{g,d,s}\,E^{(1)}_{t,d,s},1\bigr),\\
p^{(0)}_{t,d,s} &= 1 - p^{(1)}_{t,d,s}.
\end{aligned} \tag{1}
$$

*Interpretation.*  Larger $\tilde b$ (or $\tilde b_2$) steepens the linear growth in (1a), accelerating discovery; once $p^{(1)}$ or $p^{(2)}$ saturates at 1, extra sampling no longer helps.

---

### 2 Instrumental state‑action values

Let $R_g=4$ denote the reward for reaching the goal and $\gamma=\sigma(\tilde\gamma)$ an individual discount factor.  Taking control at decision $d$ (for goal depth $g$) yields

$$
V_{\text{take}} = R_g\,\gamma^{\Delta},
\qquad 
V_{\text{relinquish}} = 1,
\quad
\Delta = g-d \tag{2}
$$

If two goal paths are discovered for $g\in\{1,2\}$, relinquishing also delivers the discounted reward $V_{\text{relinquish}} = 1 + R_g\,\gamma^{\Delta}$.  The constant 1 captures an exogenous incentive to cede control when both options are equal.

---

### 3 Soft‑max choice with bias and caching

Effective action values combine model‑based assessments, a fixed bias, and a cached habitual tendency:

$$
Q^{(n)}_{t,d,s} =
w_{\text{MB}}\,\mathbf V^{(n)}_{t,d,s}
+ [\beta_{\text{bias}},\,0]
+ \beta_{\text{cache}}\,\mathbf C_{t,d,s}. \tag{3}
$$

After trial outcome $o_t\in\{-1,+1\}$ the cache updates only for the realised meta‑action $a_{t,d}$:

$$
\alpha_t =
\sigma\!\bigl(\tilde\alpha + o_t\,[\beta_{\text{rel.}},\,\beta_{\text{ctrl}}]_{a_{t,d}}\bigr), \qquad
\mathbf C_{t+1,d,s} =
\mathbf C_{t,d,s} + \alpha_t\bigl(\mathbf e_{a_{t,d}} - \mathbf C_{t,d,s}\bigr). \tag{4}
$$

A soft‑max over $Q^{(n)}_{t,d,s}$ yields policy $\pi^{(n)}_{t,d,s}(a)$.

---

### 4 Mixture likelihood

Because the planning outcome $n\in\{0,1,2\}$ is latent, the observed meta‑action probability is

$$
P\!\bigl(a_{t,d}\mid\theta\bigr) \;=\;
\sum_{n=0}^{2} p^{(n)}_{t,d,s}\;
\pi^{(n)}_{t,d,s}\!\bigl(a_{t,d}\bigr). \tag{6}
$$

Summing the log of (6) over all meta‑decisions yields the total log‑likelihood

$$
\log\mathcal L(\theta) \;=\;
\sum_{t,d}\;
\log\Bigl[
\sum_{n}
p^{(n)}_{t,d,s}\;
\pi^{(n)}_{t,d,s}\!\bigl(a_{t,d}\bigr)
\Bigr]. \tag{7}
$$

Parameters $\theta=\{\tilde b,\tilde b_2,\tilde\gamma,\tilde w,
\beta_{\text{bias}},\beta_{\text{cache}},
\tilde\alpha,\beta_{\text{rel.}},\beta_{\text{ctrl}}\}$
are estimated by maximising (7) or via hierarchical Bayesian inference.
Equations (1a)–(7) reproduce exactly the implementation
`ChoiceBias_MB_Breadth_Depth_actionSeparation_Cached`.