# Efficient Implementation of Multi-Controlled Quantum Gates

## Ben Zindorf and Sougato Bose

Department of Physics and Astronomy, University College London, Gower Street, WC1E 6BT London, United Kingdom

April 30, 2024

#### Abstract

We present an implementation of multi-controlled quantum gates which provides significant reductions of cost compared to state-of-the-art methods. The operator applied on the target qubit is a unitary, special unitary, or the Pauli X operator (Multi-Controlled Toffoli). The required number of ancilla qubits is no larger than one, similarly to known linear cost decompositions. We extend our methods for any number of target qubits, and provide further cost reductions if additional ancilla qubits are available. For each type of multi-controlled gate, we provide implementations for unrestricted (all-to-all) connectivity and for linear-nearest-neighbor. All of the methods use a linear cost of gates from the Clifford+T (fault-tolerant) set. In the context of linear-nearest-neighbor architecture, the cost and depth of our circuits scale linearly irrespective of the position of the qubits on which the gate is applied. Our methods directly improve the compilation process of many quantum algorithms, providing optimized circuits, which will result in a large reduction of errors.

## 1 Introduction

Multi-controlled single/multiple target quantum gates are pivotal ingredients for quantum computation. For example, a constructive method for implementing arbitrary unitaries, which is, in fact, used for the proof of the sufficiency of fundamental gate sets for universal quantum computation, uses such gates [1,2]. Of course, that constructive method may not be the most efficient, but nonetheless emphasizes that such gates are at the heart of quantum computation.

Such gates are also very useful in circuits for quantum chemistry, e.g. particle number conserving Hamiltonians [3], for constructing a circuit implementing a quantum neuron [4], quantum circuits for isomteries [5,6], for constructing a quantum RAM [7], in the circuit for Grover's quantum search algorithm [8], and generally, for quantum arithmetics, such as for the arithmetic part of Shor's algorithm [9].

The above general importance of such gates have triggered several works in the area of implementing multi-controlled unitaries. For example, even their approximate realizations have been of interest [10]. In terms of exact realizations, there are also several results, such as [5,11,12] (see Table 1) and [13–15]. With respect to these exact realizations, here we find and report a significant reduction of the number of elementary fault-tolerant gates (CNOT, T and H count). This, in turn, will help reduce the resources for fault-tolerant circuits, while also helping to reduce the accumulated error in NISQ circuits. We present results with minimal (non-growing) ancillae, for both all to all connectivity, as well as, motivated by hardware, a linear-nearest-neighbor architecture.

#### 1.1 Notations

In this paper we use the notations which were defined in [16]. We list some of these notations here for convenience.

Any SU(2) operator can be represented as a rotation by angle  $\lambda$  about an axis  $\hat{v}$  as  $R_{\hat{v}}(\lambda)$  [1]. We use the notation  $\Pi(\hat{v}) := iR_{\hat{v}}(\pi)$  to describe a Hermitian  $\pi$ -rotation gate. The three dimensional rotation matrix by angle  $\theta$  about the vector  $\hat{v}$  is marked as  $\hat{R}_{\hat{v}}(\theta)$ .

These notations have been used to describe the decomposition of any SU(2) operator in terms of two  $\Pi$  gates in [16] Lemma 1 as follows.

**Lemma 1.** Any  $R_{\hat{v}}(\lambda) \in SU(2)$  operator can be implemented as  $\Pi(\hat{v}_2)\Pi(\hat{v}_1)$  such that  $\hat{v}_1$  can be chosen as any unit vector perpendicular to  $\hat{v}$ , and  $\hat{v}_2 = \hat{R}_{\hat{v}}(\frac{\lambda}{2})\hat{v}_1$  with  $\frac{\lambda}{2} \in (-\pi, \pi]$ .

This lemma can be seen as a generalisation of known identities of the Pauli matrices  $X = \Pi(\hat{x}), Y = \Pi(\hat{y})$  and  $Z = \Pi(\hat{z})$ . For example, since the unit vector  $\hat{y}$  can be obtained by rotating  $\hat{x}$  about  $\hat{z}$ , as  $\hat{y} = \hat{R}_{\hat{z}}(\frac{\pi}{2})\hat{x}$ , and  $\hat{x} \perp \hat{z}$ , we get  $\Pi(\hat{y})\Pi(\hat{x}) = R_{\hat{z}}(\pi)$ , i.e. YX = -iZ.

When referring to an operator O applied on a set of qubits  $\tau = \{t_1, t_2, ..., t_m\}$ , we use the notation  $[O]_{\tau}$ . If the operator is controlled by a qubit set  $C = \{c_1, c_2, ..., c_n\}$  and applied on a qubit t, we write  $[O]_t^{\{c_1, c_2, ..., c_n\}}$  or  $[O]_t^C$ , and use one of the following gate notations:

$$c_1$$
:
 $c_2$ :

 $c_2$ :

 $c_3$ :

 $c_4$ :

 $c_5$ :

 $c_7$ :

 $c_8$ :

Both [ [16] Lemma 4 and Lemma 13 ] provide methods which allow one to transform the rotation axis of multi-controlled gates. We place these here for convenience.

**Lemma 2.**  $[e^{i\psi}R_{\hat{v}}(\lambda)]_t^C = [\Pi(\hat{v}_M)]_t[e^{i\psi}R_{\hat{v}'}(\lambda)]_t^C[\Pi(\hat{v}_M)]_t$  for any angles  $\lambda, \psi$ , unit vectors  $\hat{v}, \hat{v}'$ , and  $\hat{v}_M \in M(\hat{v}, \hat{v}')$ .

**Lemma 3.** If  $\hat{v} = \hat{R}_{\hat{\sigma}}(\phi)\hat{\tau}$  for any unit vectors  $\hat{\tau}, \hat{\sigma}$  satisfying  $\hat{\tau} \perp \hat{\sigma}$ , then  $[\Pi(\hat{v})]_t^C = [R_{\hat{\sigma}}(\phi)]_t [\Pi(\hat{\tau})]_t^C [R_{\hat{\sigma}}(-\phi)]_t$ .

The set  $M(\hat{v}_1, \hat{v}_2)$  is defined for any two unit vectors  $\hat{v}_1, \hat{v}_2$  such that  $\hat{v}_M \in M(\hat{v}_1, \hat{v}_2)$  iff  $\hat{v}_2 = \hat{R}_{\hat{v}_M}(\pi)\hat{v}_1$ . In other words,  $\hat{v}_M$  can be chosen as a unit vector that is in the "middle" between  $\hat{v}_1$  and  $\hat{v}_2$ . A well-known special case of Lemma 2 is achieved by noting that  $H = \Pi(\hat{h})$ , with  $\hat{h} \in M(\hat{x}, \hat{z})$ .

The phase gate is defined as:

$$P(\lambda) := e^{i\frac{\lambda}{2}} R_{\hat{z}}(\lambda) = \begin{pmatrix} 1 & 0 \\ 0 & e^{i\lambda} \end{pmatrix}$$

We use the standard notations for two specific phase gates:  $S := P(\frac{\pi}{2}) = \begin{pmatrix} 1 & 0 \\ 0 & i \end{pmatrix}$ , and  $T := P(\frac{\pi}{4})$ .

The notation  $C[a:b] := \{c_a, c_{a+1}, ..., c_b\}$  is used to define a subset, and  $C[a:a] = \{c_a\}$  is written as C[a]. When concatenating two sets of qubits, we simply use  $\{C, \tau\} := \{c_1...c_n, t_1...t_m\}$ . For a controlled phase gate, the phase is applied iff all of the control qubits and the target qubit are in the state  $|1\rangle$ , and therefore it is not required to specify the target. For example, the gate  $[Z]_t^C$  can also be written as  $[Z]_t^{\{C,t\}}$ . We refer to any gate which can be decomposed using only multi-controlled (or single-qubit) phase gates as a relative phase gate, usually marked as  $[\Delta]$ . Finally, a multi-controlled multi-target gate  $\prod_{i=1}^m [O]_{t_i}^C$  is marked as follows.

$$C: \frac{/n}{t_1: O_1}$$
 $t_2: O_2$ 
 $t_m: O_m$ 
 $t_m: O_m$ 

## 2 All-to-all (ATA) connectivity

## 2.1 Multi-controlled SU(2)

In this section, we focus on a cost-efficient implementation of multi-controlled SU(2) gates (MCSU2) with a single or multiple targets and no ancilla qubit.

#### 2.1.1 MCSU2 macro structure

We provide a decomposition of any multi-controlled gate  $[W]_t^C$  such that  $W \in SU(2)$ . Any such W can be expressed as a rotation about an axis  $\hat{v}$  by an angle  $\lambda$  as  $W = R_{\hat{v}}(\lambda)$ . The decomposition requires eight single-qubit gates from  $\{R_{\hat{x}}, R_{\hat{z}}\}$ , in addition to four large relative phase gates.

As a first step we provide the following decomposition.

$$C_{1}: \frac{/n_{1}}{C_{2}: \frac{/n_{2}}{W}} = C_{1}: \frac{/n_{1}}{U}$$

$$C_{2}: \frac{/n_{2}}{W} = C_{2}: \frac{/n_{2}}{U}$$

$$C_{3}: \frac{/n_{2}}{U} = C_{2}: \frac{/n_{2}}{U} =$$

**Lemma 4.** Any  $[R_{\hat{v}}(\lambda)]_t^C$  gate with  $R_{\hat{v}}(\lambda) \in SU(2)$  can be implemented as  $[\Pi(\hat{v}_2)]_t^{C_2}[\Pi(\hat{v}_1)]_t^{C_1}[\Pi(\hat{v}_2)]_t^{C_2}[\Pi(\hat{v}_1)]_t^{C_1}$ , such that  $C_1 \cup C_2 = C$ .  $\hat{v}_1$  can be chosen as any unit vector perpendicular to  $\hat{v}$ , and  $\hat{v}_2 = \hat{R}_{\hat{v}}(\frac{\lambda}{4})\hat{v}_1$ .

*Proof.* We define the pair  $(g_1, g_2)$ , where  $g_{j \in \{1,2\}} = 1$  denotes  $C_j$  in the computational basis state  $|11..1\rangle$ , and  $g_j = 0$  otherwise. The operator applied on the target qubit for each option of the pair  $(g_1, g_2)$ :

$$(00): I, (01): (\Pi(\hat{v}_2))^2 = I, (10): (\Pi(\hat{v}_1))^2 = I, (11): (\Pi(\hat{v}_2)\Pi(\hat{v}_1))^2$$

Since  $\hat{v}_1 \perp \hat{v}$  and  $\hat{v}_2 = \hat{R}_{\hat{v}}(\frac{\lambda}{4})\hat{v}_1$ , according to Lemma 1,  $\Pi(\hat{v}_2)\Pi(\hat{v}_1) = R_{\hat{v}}(\frac{\lambda}{2})$ , and the operator applied on the target qubit for option (11) is  $(R_{\hat{v}}(\frac{\lambda}{2}))^2 = R_{\hat{v}}(\lambda)$ .

To minimize the number of arbitrary single-qubit rotations from  $\{R_{\hat{x}}, R_{\hat{z}}\}$ , it is beneficial to use Lemma 4 in order to implement a multi-controlled SU(2) rotation about the  $\hat{x}$  axis and then transform the axis using Lemma 2. Identity gates in the form of relative phase gates  $[\Delta]_C$ , and their inverse  $[\Delta^{\dagger}]_C$  can be added to produce the following circuit.

$$C_{1}: \frac{/n_{1}}{C_{2}: \frac{/n_{2}}{V_{1}}} = C_{1}: \frac{/n_{1}}{\Delta_{1}} \Delta_{2} \Delta_{1} \Delta_{1} \Delta_{2} \Delta_{2}$$

$$t: W A_{2} A_{3} A_{2} A_{1}$$

$$(2)$$

The additional  $[\Delta]_C$  gates can be set as any relative phase gates. Decomposing each multi-controlled Z (MCZ) gate, combined with its corresponding  $[\Delta]_C$  in terms of basic gates allows for an efficient implementation, as we show in Section 2.1.2.

**Lemma 5.** Any  $[R_{\hat{v}}(\lambda)]_t^C$  gate with  $R_{\hat{v}}(\lambda) \in SU(2)$  can be implemented with eight gates from  $\{R_{\hat{x}}, R_{\hat{z}}\}$ , two  $[Z]_t^{C_1}$  and two  $[Z]_t^{C_2}$  gates as Circuit (2) such that  $A_2 = R_{\hat{x}}(-\frac{\lambda}{4})$ ,  $A_3 = R_{\hat{x}}(\frac{\lambda}{4})$ ,  $A_4 = R_{\hat{z}}(\theta_2)R_{\hat{x}}(\theta_1)$ ,  $A_1 = A_4^{\dagger}A_3$ , and  $C_1 \cup C_2 = C$ . The  $[\Delta]_C$  gates apply a relative phase, and the angles  $\theta_1, \theta_2$  are defined by the axis  $\hat{v}$ .

*Proof.* Using Lemma 2 with  $\psi = 0$  and  $\hat{v}_M \in M(\hat{v}, \hat{x})$ , we can write

$$[R_{\hat{v}}(\lambda)]_t^C = [\Pi(\hat{v}_M)]_t [R_{\hat{x}}(\lambda)]_t^C [\Pi(\hat{v}_M)]_t.$$

The gate  $\Pi(\hat{v}_M)$  can be decomposed up to a global phase using the XZX Euler angles as  $R_{\hat{x}}(\theta_3)R_{\hat{z}}(\theta_2)R_{\hat{x}}(\theta_1) = R_{\hat{x}}(\theta_3)A_4$ , and since  $\Pi(\hat{v}_M)$  is hermitian, it can be implemented as  $A_4^{\dagger}R_{\hat{x}}(-\theta_3)$  as well. Substituting and applying cancellations provides

$$[R_{\hat{v}}(\lambda)]_t^C = [A_4^{\dagger}]_t [R_{\hat{x}}(\lambda)]_t^C [A_4]_t$$

From Lemma 4 with  $\hat{v}_1 = \hat{z}$  and  $\hat{v}_2 = \hat{R}_{\hat{x}}(\frac{\lambda}{4})\hat{z}$  we get

$$[R_{\hat{x}}(\lambda)]_t^C = [\Pi(\hat{v}_2)]_t^{C_2} [Z]_t^{C_1} [\Pi(\hat{v}_2)]_t^{C_2} [Z]_t^{C_1}$$

Finally, each  $[\Pi(\hat{v}_2)]_t^{C_2}$  gate can be implemented as

$$[\Pi(\hat{v}_2)]_t^{C_2} = [R_{\hat{x}}(\frac{\lambda}{4})]_t [Z]_t^{C_2} [R_{\hat{x}}(-\frac{\lambda}{4})]_t = [A_3]_t [Z]_t^{C_2} [A_2]_t$$

according to Lemma 3. Since relative phase gates commute with each other and with  $[Z]_t^{C_1}$ ,  $[Z]_t^{C_2}$ , the effects of the additional  $[\Delta]_C$  gates cancels out.

#### 2.1.2 MCSU2 decomposition

In this section, we provide a decomposition of  $[\Delta_1]_C[Z]_t^{C_1}$ ,  $[\Delta_2]_C[Z]_t^{C_2}$  and their inverse in terms of Clifford+T gates, as required for the implementation described in Lemma 5. The following lemma will be used to develop our structure.

**Lemma 6.** For two sets of qubits A, B and a qubit  $d \notin A \cup B$ , the following holds  $[\Pi(\hat{v}_2)]_d^B [\Pi(\hat{v}_1)]_d^A [\Pi(\hat{v}_2)]_d^B = [Z]^{\{A,B\}} [\Pi(\hat{v}_1)]_d^A$  if the unit vectors  $\hat{v}_1$  and  $\hat{v}_2$  are perpendicular.

*Proof.* For any choice of two perpendicular vectors  $\hat{v}_1 \perp \hat{v}_2$ , a unit vector  $\hat{v}$  perpendicular to both can be defined as  $\hat{v} = \hat{v}_1 \times \hat{v}_2$  such that  $\hat{v}_2 = \hat{R}_{\hat{v}}(\frac{\pi}{2})\hat{v}_1$ . Therefore, according to Lemma 4,

$$[\Pi(\hat{v}_2)]_d^B[\Pi(\hat{v}_1)]_d^A[\Pi(\hat{v}_2)]_d^B[\Pi(\hat{v}_1)]_d^A = [R_{\hat{v}}(2\pi)]_d^{\{A,B\}}$$

We note that for any unit vector  $\hat{v}$ ,  $[R_{\hat{v}}(2\pi)]_d^{\{A,B\}} = [-I]_d^{\{A,B\}}$ . This operator merely applies a phase of -1 if all qubits in set  $A \cup B$  are in the computational basis state  $|1\rangle$ , and does not change the state of the target qubit d. Therefore, it is equivalent to applying a multi-controlled Z gate on set  $A \cup B$ , and we can write  $[-I]_d^{\{A,B\}} = [Z]^{\{A,B\}}$  as mentioned in [14]. By substituting this into the equation above and applying the Hermitian gate  $[\Pi(\hat{v}_1)]_d^A$  on the right-hand side, we get the required equation.

For our construction, we use Lemma 6 with  $\hat{v}_1 = \hat{z}, \hat{v}_2 = \hat{x}$ , and set  $B = \{c, d'\}$ , which results in  $[X]_d^{\{c,d'\}}[Z]_d^A[X]_d^{\{c,d'\}} = [Z]_{d'}^{\{A,c\}}[Z]_d^A$  as follows.

We introduce the following notation:

$$\{Z\}_{C,D}^m = \prod_{j=2}^m [Z]_{d_{j-1}}^{C[1:j]}$$

where  $C=\{c_1,c_2,..,c_n\}$  and  $D=\{d_1,d_2,..,d_{n-1}\}$  are two qubit sets, and m can be chosen such that  $2 \leq m \leq n$ . In addition, we use the  $[X_{\Delta}]_{q_3}^{\{q_1,q_2\}}$  notation for an hermitian relative phase Toffoli gate which satisfies

This gate notation allows us to distinguish between the control qubits. We use a symbol which combines the  $|0\rangle$ -controlled and  $|1\rangle$ -controlled symbols, noting that for each of these basis states of  $q_1$ , a different controlled operator is effectively applied on  $q_2$  and  $q_3$ .

We now show that two  $[X_{\Delta}]$  gates can be used to transform  $\{Z\}_{C,D}^{m-1}$  to  $\{Z\}_{C,D}^{m}$ .

**Lemma 7.** 
$$\{Z\}_{C,D}^m = [X_{\Delta}]_{d_{m-2}}^{\{c_m,d_{m-1}\}} \{Z\}_{C,D}^{m-1} [X_{\Delta}]_{d_{m-2}}^{\{c_m,d_{m-1}\}} \text{ for } 3 \leq m \leq |C|, \text{ and } |D| \geq m-1.$$

*Proof.* We prove for the following equivalent equation, in which  $[X_{\Delta}]$  are replaced with Toffoli gates:

$$\{Z\}_{C,D}^m = [X]_{d_{m-2}}^{\{c_m,d_{m-1}\}} \{Z\}_{C,D}^{m-1} [X]_{d_{m-2}}^{\{c_m,d_{m-1}\}}$$

where the equivalence can be easily verified by substituting each  $[X_{\Delta}]$  gate with its definition which places all of the relative phase gates at the center, then these are commuted and cancelled out. For m=3:

$$[X]_{d_1}^{\{c_3,d_2\}}\{Z\}_{C,D}^2[X]_{d_1}^{\{c_3,d_2\}} = [X]_{d_1}^{\{c_3,d_2\}}[Z]_{d_1}^{\{c_1,c_2\}}[X]_{d_1}^{\{c_3,d_2\}} \stackrel{*}{=} [Z]_{d_1}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2,c_3\}} = \{Z\}_{C,D}^3[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2}^{\{c_1,c_2\}}[Z]_{d_2$$

where \* is according to Lemma 6.

For  $m \geq 4$ , by definition:

$${\{Z\}_{C,D}^{m-1} = [Z]_{d_{m-2}}^{C[1:m-1]} \{Z\}_{C,D}^{m-2}}$$

Since  $c_m, d_{m-1}, d_{m-2}$  are not accessed in  $\{Z\}_{C,D}^{m-2}$ , this gate commutes with  $[X]_{d_{m-2}}^{\{c_m, d_{m-1}\}}$ . Therefore:

$$[X]_{d_{m-2}}^{\{c_m,d_{m-1}\}}\{Z\}_{C,D}^{m-1}[X]_{d_{m-2}}^{\{c_m,d_{m-1}\}} = [X]_{d_{m-2}}^{\{c_m,d_{m-1}\}}[Z]_{d_{m-2}}^{C[1:m-1]}[X]_{d_{m-2}}^{\{c_m,d_{m-1}\}}\{Z\}_{C,D}^{m-2}$$

According to Lemma 6:

$$[X]_{d_{m-2}}^{\{c_m,d_{m-1}\}}[Z]_{d_{m-2}}^{C[1:m-1]}[X]_{d_{m-2}}^{\{c_m,d_{m-1}\}} = [Z]_{d_{m-1}}^{C[1:m]}[Z]_{d_{m-2}}^{C[1:m-1]}$$

Since  $C[1:m-1] \cup \{c_m\} = C[1:m]$ . Finally, we can substitute and use

$$[Z]_{d_{m-1}}^{C[1:m]}[Z]_{d_{m-2}}^{C[1:m-1]}\{Z\}_{C,D}^{m-2}=\{Z\}_{C,D}^{m}$$

The following lemma is simply achieved by repeatedly applying the recursive formula presented in Lemma 7, and as a final step, applying  $[S]_{c_2}^{c_1}$  to both sides. This added gate commutes with any other gate in the circuit and transforms  $[Z]_{d_1}^{\{c_1,c_2\}}$  to  $[iZ]_{d_1}^{\{c_1,c_2\}}$ :

**Lemma 8.** 
$$\{Z\}_{C,D}^n[S]_{c_2}^{c_1} = (\prod_{j=3}^n [X_{\Delta}]_{d_{j-2}}^{\{c_j,d_{j-1}\}})^{\dagger}[iZ]_{d_1}^{\{c_1,c_2\}}(\prod_{j=3}^n [X_{\Delta}]_{d_{j-2}}^{\{c_j,d_{j-1}\}}) \text{ for } n \geq 3, \ |C| = n, \ \text{and } |D| = n-1.$$

The following circuits describes the decomposition of  $\{Z\}_{C,D}^5[S]_{c_2}^{c_1}$  in terms of  $[X_{\Delta}]_{q_3}^{\{q_1,q_2\}}$  and  $[iZ]_{q_3}^{\{q_1,q_2\}}$  gates, or  $[X_{\Delta}]$  and [iZ] for short:



5

This structure requires 2n-4  $[X_{\Delta}]$  gates and one [iZ] gate. We use the following decompositions for  $[X_{\Delta}]_{q_3}^{\{q_1,q_2\}}$  [17], and  $[iZ]_{q_3}^{\{q_1,q_2\}}$ .

Each  $[X_{\Delta}]$  requires two H, three CNOT and four T gates. It can be noted that when applied as part of Lemma 8, the gates to the left of the barrier in Circuit (6) can commute with any gate which is executed before. This, along with additional similar commutations, allows for depth reductions, as shown in Appendix A. A simple summation provides the following lemma.

**Lemma 9.** A  $\{Z\}_{C,D}^n[S]_{c_2}^{c_1}$  gate with  $n \ge 3$ , |C| = n, and |D| = n - 1 can be implemented with CNOT cost 6n - 8 and depth 4n - 2, T cost 8n - 12 and depth 4n, H cost 4n - 8 and depth 2n - 2.

We wish to use this to decompose the structure described in Lemma 5 which implements any  $[R_{\hat{v}}(\lambda)]_t^C$  (MCSU2) gate. We choose  $[\Delta_1]_C = [S]^{C_1[1:2]} \{Z\}_{C_1,C_2'}^{n_1-1}$  and  $[\Delta_2]_C = [S]^{C_2[1:2]} \{Z\}_{C_2,C_1'}^{n_2-1}$  as relative phase gates applied on the set C of size n. For this choice we get

$$[\Delta_1]_C[Z]_t^{C_1} = [S]^{C_1[1:2]} \{Z\}_{C_1, \{C_2', t\}}^{n_1}, \text{ and } [\Delta_2]_C[Z]_t^{C_2} = [S]^{C_2[1:2]} \{Z\}_{C_2, \{C_1', t\}}^{n_2}$$

The inverted versions of these gates are implemented in a similar way.

The subsets  $C_1, C_2, C_1'$  and  $C_2'$  are of size  $n_1, n_2, n_1'$  and  $n_2'$  respectively, and satisfy  $C_1 \cup C_2 = C, C_1' \in C_1, C_2' \in C_2, n_1' = n_2 - 2, n_2' = n_1 - 2$  and  $n_1 + n_2 = n$ . We can set  $n_1 = \lfloor \frac{n}{2} \rfloor$  and  $n_2 = n - n_1 = \lceil \frac{n}{2} \rceil$ .

These structures can be used in order to decompose Circuit (2). Each cost and depth from Lemma 9 can be written as an + b, and therefore, the counterparts for MCSU2 equal to  $2(an_1 + b + an_2 + b) = 2an + 4b$ . As shown in Appendix A, the sets  $C'_1, C'_2$  can be chosen such that additional depth reductions are achieved. The following theorem presents the total costs and depths.

**Theorem 1.** Any multi-controlled SU(2) gate with  $n \ge 6$  controls can be implemented without ancilla using eight gates from  $\{R_{\hat{x}}, R_{\hat{z}}\}$  in addition to  $CNOT \cos 12n - 32$  and depth 8n - 8,

T  $cost\ 16n-48$  and depth 8n-6 (8n-3 for odd n),

H cost 8n-32 and depth 4n-11.

We note that if  $2 \le n \le 5$ , the costs and depths can be easily found using Lemma 5, such that Lemma 8 is used for a subset of controls of size 3, Circuit (7) is used for size 2, and a CZ (as H-CNOT-H) gate is used for size 1. The number of single-qubit gates can be reduced in some of these cases.

Moreover, we show in Appendix B that a trade-off between the depth of CNOT and T gates can be achieved. With no change to any cost, it is possible to reduce the depth of T gates to 5n + O(1), while increasing the depth of CNOT getes to 11n + O(1). The T depth can be further reduced to 4n, however, this results in a CNOT cost and depth of 12.5n + O(1) and 12n + O(1) respectively.

In Appendix C we show that in case a set  $\chi$  of  $n_{\chi} \leq \lfloor \frac{n-6}{2} \rfloor$  dirty ancilla qubits is available, each ancilla can be used to reduce the cost of Clifford+T gates, as presented in Figure 1. If  $n_{\chi} = \lfloor \frac{n-6}{2} \rfloor$ , the cost of both the T and CNOT gates is reduced to 8n + O(1).

#### 2.1.3 Multi-controlled multi-target SU(2)

We present a method to extend the multi-controlled SU(2) structure to implement a multi-control multi-target SU(2) gate (MCMTSU2), in which any SU(2) operator can be applied to each target. We use the following structure.



**Lemma 10.** Any  $\prod_{j=1}^{m} [R_{\hat{v}_{j}}(\lambda_{j})]_{t_{j}}^{C}$  gate with  $R_{\hat{v}_{j}}(\lambda_{j}) \in SU(2)$  can be implemented using 8m gates from  $\{R_{\hat{x}}, R_{\hat{z}}\}$ , two  $\prod_{j=1}^{m} [Z]_{t_{j}}^{C_{1}}$  and two  $\prod_{j=1}^{m} [Z]_{t_{j}}^{C_{2}}$  gates as Circuit (8) such that  $A_{2}^{j} = R_{\hat{x}}(-\frac{\lambda_{j}}{4}), A_{3}^{j} = R_{\hat{x}}(\frac{\lambda_{j}}{4}), A_{4}^{j} = R_{\hat{z}}(\theta_{2}^{j})R_{\hat{x}}(\theta_{1}^{j}), A_{1}^{j} = A_{4}^{j\dagger}A_{3}^{j}, \text{ and } C_{1} \cup C_{2} = C.$  The  $[\Delta]_{C}$  gates apply a relative phase.

*Proof.* This can be realised by applying the structure described in Lemma 5 for each  $[R_{\hat{v}_j}(\lambda_j)]_{t_j}^C$  gate separately. The gates can be reordered to achieve Circuit (8) since MCZ gates commute with each other, and single-qubit gates operating on different qubits commute as well.

The following identity can be derived from Lemma 6.

$$C_{1}: \frac{/n_{1}}{\Delta} \qquad = C_{1}: \frac{/n_{1}}{\Delta} \qquad = C_{2}: \frac{/n_{2}}{\Delta} \qquad = C_{2$$

Lemma 11 describes the structure used to extend any MCZ gates to a MCMTZ gate with m targets, using 2(m-1) CNOT gates in depth  $2\lceil log_2(m)\rceil$ .

**Lemma 11.**  $[\Delta']_C \prod_{j=1}^m [Z]_{t_j}^{C'} = (\prod_{k=1}^{\lceil \log_2(m) \rceil} \prod_{l=1}^{f'} [X]_{t_l}^{t_{l+f}}) [\Delta']_C [Z]_{t_1}^{C'} (\prod_{k=1}^{\lceil \log_2(m) \rceil} \prod_{l=1}^{f'} [X]_{t_l}^{t_{l+f}})^{\dagger}$  with  $C' \in C$ , and  $t_j \notin C$ , such that  $f = 2^{k-1}$  and f' = min(f, m-f).

*Proof.* As demonstrated in Circuit (9), it follows from Lemma 6 that

$$[\Delta']_C[Z]_{t_j}^{C'}[Z]_{t_j+f}^{C'} = [X]_{t_j}^{t_{j+f}}[\Delta']_C[Z]_{t_j}^{C'}[X]_{t_j}^{t_{j+f}}$$

The number of targets can therefor be doubled by applying a CNOT transformation in depth 2, as follows:

$$[\Delta']_C \prod_{j=1}^{2f} [Z]_{t_j}^{C'} = (\prod_{l=1}^f [X]_{t_l}^{t_{l+f}}) [\Delta']_C \prod_{j=1}^f [Z]_{t_j}^{C'} (\prod_{l=1}^f [X]_{t_l}^{t_{l+f}})$$

This can be repeatedly applied until  $log_2(f) = \lfloor log_2(m) \rfloor$ . The remaining m - f targets can be added similarly by applying  $\prod_{l=1}^{m-f} [X]_{t_l}^{t_{l+f}}$  at each side.

Circuit (10) describes the structure for m=8.



The following can be obtained using Lemma 10 and Lemma 11.

**Theorem 2.** Any multi-controlled multi-target SU(2) with  $n \geq 6$  controls, and  $m \geq 1$  targets can be implemented without ancilla in the same costs and depths of an MCSU2 gate implemented using Lemma 5, with an increase of 8(m-1) to both  $\{R_{\hat{x}}/R_{\hat{z}}\}$  and CNOT count, and  $8\lceil \log_2(m) \rceil$  to CNOT depth.

The final costs and depths of our implementation can therefore be achieved by applying these adjustments to the costs and depths listed in Theorem 1.

#### 2.2 Multi-controlled X

In this section, we use the MCSU2 structure in order to efficiently implement the Multi-controlled X gate (MCX, also known as MCT) with a single or multiple targets, and with a single dirty ancilla qubit.

#### 2.2.1 MCX

In order to implement a  $[X]_t^C$  gate with one dirty ancilla qubit a, we use the implementation of  $[R_{\hat{v}}(\lambda)]_{t'}^{C'}$  with  $\lambda = 2\pi$ ,  $C' = \{C, t\}$ , and t' = a. As previously shown,  $[R_{\hat{v}}(2\pi)]_a^{\{C, t\}} = [-I]_a^{\{C, t\}} = [Z]_t^C$  for any choice of  $\hat{v}$ , and simply using the well known special case of Lemma 2, we get  $[X]_t^C = [H]_t[R_{\hat{v}}(2\pi)]_a^{\{C, t\}}[H]_t$ . Setting  $\hat{v} = \hat{y}$  and implementing  $[R_{\hat{v}}(2\pi)]_a^{\{C, t\}}$  according to Lemma 4, with  $\hat{v}_1 = \hat{z}$ , and  $\hat{v}_2 = \hat{x}$  is a convenient choice. The relative phase  $[\Delta]$  gates are added similarly to Lemma 5. This provides the following circuit.

$$C_{1}: \frac{/n_{1}}{C_{2}: \frac{/n_{2}}{C_{2}: \frac{/n_{2}}{C_{2$$

The cost and depth of this implementation is similar to the multi-controlled SU(2) gate with n+1 controls. Arbitrary  $\{R_{\hat{x}}, R_{\hat{z}}\}$  gates are removed, and six H gates are added instead. Notice that the control set C' can be ordered such that two H gates are cancelled out in the full Clifford+T decomposition. Theorem 3 simply follows.

**Theorem 3.** Any multi-controlled X gate with  $n \geq 5$  controls and a single target can be implemented with one dirty ancilla qubit at the same costs and depths in Theorem 1, with n replaced by n+1 and four additional H gates. No arbitrary  $\{R_{\hat{x}}, R_{\hat{x}}\}$  gates are needed.

#### 2.2.2 MCMTX

We extend the multi-controlled X structure to a multi-control multi-target X gate (MCMTX). First we demonstrate that if the number of target qubits is larger than one, an ancilla qubit is not required. The following can be easily deduced from Lemma 6 (and has been shown in [18]).

$$C_{1}: \frac{/n_{1}}{C_{2}: \frac{/n_{2}}{C_{2}: \frac{/n_{2}}{C_{2$$

Since  $t_2$  is not used while the MCX gate is applied, it can be utilized as the required dirty ancilla previously referred to as qubit a. Similarly, Lemma 11 can be used in order to add any number of targets as follows.

The last step is achieved by commuting and cancelling out the H gates, which results in transforming controls into targets and vice versa, as been mentioned in [19], *Proposition 1*]. Theorem 4 simply follows.

**Theorem 4.** Any multi-controlled multi-target X gate with  $n \geq 5$  controls, and  $m \geq 2$  targets can be implemented without ancilla in the same costs and depths of an MCX gate, with an increase of 2(m-1) to CNOT count, and  $2\lceil \log_2(m) \rceil$  to CNOT depth.

The costs and depths of our MCMTX structure can be achieved by merging Theorem 4 and Theorem 3. As can be seen in Circuit (13), m-1 targets can be used as dirty ancilla qubits for the implementation of the MCX gate, while one acts as the required ancilla qubit a, and the rest m-2 can be used in order to reduce the cost of the MCX gate. We discuss these cost reductions in Appendix C.

The results of this section can be used to implement any multi-controlled multi-target  $\Pi$  gate as well, simply by using Lemma 2 in order to change the axes of rotations. This requires no more than four arbitrary  $\{R_{\hat{x}}, R_{\hat{z}}\}$  gates per target.

### 2.3 Multi-controlled U(2)

In this section we use the MCMTSU2 structures in order to efficiently implement the multi-controlled U(2) gate (MCU2) using one clean ancilla, with a single or multiple targets.

### 2.3.1 MCMTU2

In order to implement any  $[U]_t^C$  gate, with  $U \in U(2)$ , we recall that U can be expressed as  $U = e^{i\psi} R_{\hat{v}}(\lambda)$ , and therefore  $[U]_t^C = [R_{\hat{v}}(\lambda)]_t^C [P(\psi)]^C$ . We now show that  $[U]_t^C = [R_{\hat{v}}(\lambda)]_t^C [R_{\hat{z}}(-2\psi)]_{a_{|0\rangle}}^C$  with  $a_{|0\rangle}$  a clean ancilla qubit in state  $|0\rangle$ .

**Lemma 12.**  $[P(\psi)]^C = [R_{\hat{z}}(-2\psi)]^C_{a_{|0\rangle}}$  for any angle  $\psi$  with  $C = \{c_1, c_2, ..., c_n\}$ , and  $a_{|0\rangle}$  is a clean ancillar qubit in state  $|0\rangle$ .

*Proof.* The gate  $[R_{\hat{z}}(-2\psi)]_{a_{|0\rangle}}^C$  applies the following operator on the ancilla qubit iff set C is in state  $|11..1\rangle$ :

$$R_{\hat{z}}(-2\psi) = \begin{pmatrix} e^{i\psi} & 0\\ 0 & e^{-i\psi} \end{pmatrix} = e^{i\psi} \begin{pmatrix} 1 & 0\\ 0 & e^{i(-2\psi)} \end{pmatrix}$$

Therefore, it can be written as  $[R_{\hat{z}}(-2\psi)]_{a_{|0\rangle}}^C = [P(-2\psi)]_{a_{|0\rangle}}^C [P(\psi)]^C$ . Since  $a_{|0\rangle}$  is known to be in the state  $|0\rangle$ , the controlled gate  $[P(-2\psi)]_{a_{|0\rangle}}^C$  has no effect and can be removed.

The gate  $[R_{\hat{z}}(2\psi)]_{a_{|0\rangle}}^C$  can be implemented using Lemma 5, with  $A_4$  transforming the  $\hat{z}$  axis to  $\hat{x}$  and thus can be replaced with H. The gate  $[R_{\hat{z}}(-2\psi)]_{a_{|0\rangle}}^C$  can therefore be implemented as the inverse of this structure as:

$$C_{1}: \frac{/n_{1}}{C_{2}: \frac{/n_{2}}{C_{2}: \frac{/n_{2}}{C_{2$$

with  $B_1 = R_{\hat{x}}(\frac{\psi}{2})$ ,  $B_2 = R_{\hat{x}}(-\frac{\psi}{2})$ . The first  $B_2$  gate can commute with the first H gate, and transform to an  $R_{\hat{z}}$  gate operating on the state  $|0\rangle$ . This gate therefore only applies a global phase and can be removed. Since  $[R_{\hat{v}}(\lambda)]_t^C [R_{\hat{z}}(-2\psi)]_{a_{|0\rangle}}^C$  is an MCMTSU2 gate with m=2 targets, it can be implemented up to a global phase using Lemma 10 and Lemma 11 as follws.



Moreover, any MCMTU2 gate  $\prod_{j=1}^m [U_j]_{t_j}^C$  with  $m \geq 1$  targets, such that  $U_j = e^{i\psi_j} R_{\hat{v}_j}(\lambda_j) \in U(2)$ , can be implemented as the MCMTSU2 gate  $[R_{\hat{z}}(-2\psi)]_{a_{|0\rangle}}^C \prod_{j=1}^m [R_{\hat{v}_j}(\lambda_j)]_{t_j}^C$  such that  $R_{\hat{z}}(-2\psi), R_{\hat{v}_j}(\lambda_j) \in SU(2)$  and  $\psi = \sum_{j=1}^m \psi_j$ . Theorem 5 follows.

**Theorem 5.** Any multi-controlled multi-target U(2) gate with  $n \geq 6$  controls, and  $m \geq 1$  targets can be implemented with one ancilla qubit in state  $|0\rangle$  in the same costs and depths of an MCMTSU2 gate implemented using Theorem 2 with m+1 targets, and five  $\{R_{\hat{z}}/R_{\hat{x}}\}$  gates replaced with two H gates.

Our results for the discussed multi-control single-target gates are summarized in Table 1. We compare our MCSU2 and MCX implementations to those suggested in [11] and [5] respectively. For MCU2, we compare to the structure described in [12]. The structure requires two MCX gates for which we assume the use of the implementation from [5]. We list the leading terms of the cost and depth of the CNOT and T gates. As can be seen, we provide improvements in each category. Our cost reductions of the CNOT and T gates are 25% - 62.5%, and up to 50% respectively, and the depth reductions are between 50% - 75%.

| Gate  |           |           | CNOT |       | T    |       |
|-------|-----------|-----------|------|-------|------|-------|
| Type  | Ancilla   | Source    | Cost | Depth | Cost | Depth |
| MCSU2 | None      | [11]      | 20n  | 20n   | 20n  | 20n   |
|       |           | Ours      | 12n  | 8n    | 16n  | 8n    |
| MCX   | One dirty | [5]       | 16n  | 16n   | 16n  | 16n   |
|       |           | Ours      | 12n  | 8n    | 16n  | 8n    |
| MCU2  | One clean | [12], [5] | 32n  | 32n   | 32n  | 32n   |
|       |           | Ours      | 12n  | 8n    | 16n  | 8n    |

Table 1: A summary of our ATA multi-controlled single-target results compared to previous methods.

Table 3 presents our results for an alternative method which favours the depth of T gates, as discussed in Appendix B.

We discuss cost reductions which can be applied, in case additional dirty ancilla qubits are available, in Appendix C. As noted, these reductions can also be applied for MCMTX gates with m > 2 targets without ancilla. The results for our multi-controlled single-target gate implementations with additional ancilla are presented in Figure 1.

## 3 Linear-nearest-neighbor (LNN) connectivity

## 3.1 Multi-controlled SU(2)

#### 3.1.1 MCSU2 macro structure

In this section we provide an implementation of any LNN-restricted multi-controlled SU(2) gate with n control qubits  $C = \{c_1, c_2, ..., c_n\}$  and a single target qubit t. The gate is implemented in a circuit over  $k' \geq n + 1$  qubits  $Q' = \{q'_1, q'_2, ..., q'_{k'}\}$  such that the qubits  $q'_l$  and  $q'_{l+1}$  are nearest neighbors for  $1 \leq l < k'$  (LNN qubits).

We define the set  $Q = \{q_1, q_2, ..., q_k\}$  of size k, such that  $Q \in Q'$ , as the *smallest* set of LNN qubits which satisfies  $\{C, t\} \in Q$ . Each control qubit in the set C corresponds to an arbitrary qubit in the set  $Q \setminus t$ , and it is assumed that the target qubit is located at the bottom, i.e.  $t = q_k$ , which implies  $q_1 \in C$ . We will discuss the additional cost (communication overhead) required in case  $t \neq q_k$  in Section 3.1.2.

The macro structure described in Lemma 5 can be used directly. In order to allow for an efficient LNN implementation, we require an additional constraint on the subsets  $C_1$  and  $C_2$ , such that in each subset only the first two qubits may be nearest neighbors. For any given sets C, Q, the subsets can be defined such that this constraint is met, and  $C_1 \cup C_2 = C$ . We use the following procedure to define the sets.

Starting with  $C_1 = \{q_1\}, C_2 = \emptyset$ , and the iteration index l = 2. At each iteration,  $q_l$  is added to  $C_1$  if

$$(q_l \in C) \land (l = 2 \lor (q_{l-1} \not\in C_1 \land q_{l-1} \neq C_2[1])).$$

If this condition is not met and  $q_l \in C$ , the qubit  $q_l$  is added to  $C_2$ . Finally, l increases by 1 for the next iteration, stopping when  $C_1 \cup C_2$  holds all of the control qubits in C, i.e.  $n_1 + n_2 = n$ .

We define two subsets  $Q_1, Q_2 \in Q$  of sizes  $k_1$  and  $k_2$  as the *smallest* sets of LNN qubits which satisfy  $\{C_1, t\} \in Q_1$  and  $\{C_2, t\} \in Q_2$ . The relative phase gates  $[\Delta_1]$  and  $[\Delta_2]$  from Lemma 5 operate on  $Q_1 \setminus t$  and  $Q_2 \setminus t$ .

For a specific example of a given set  $Q = \{q_1, q_2, ..., q_{19}\}$ , such that  $t = q_{19}$ , we use a binary string to describe the control sets  $C, C_1, C_2$  such that the string holds "1" for index l iff  $q_l$  is in the control set. The following provides an arbitrarily chosen set C, and the corresponding subsets  $C_1, C_2$  achieved using the procedure above.

$$C \ = [1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0]$$

$$C_1 = [1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0]$$

$$C_2 = [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0]$$

As can be seen, for each set  $C_1, C_2$  it holds that there are no neighboring controls, excluding the first two. In this example,  $Q_1 = Q$ , and  $Q_2 = Q[9:19]$  are of size  $k_1 = 19$ , and  $k_2 = 11$ .

The following circuit describes the structure achieved for this example.



We note that in case  $C_2 = \emptyset$ , each structure controlled by  $C_2$  is implemented as a single Z gate which can be cancelled out. This allows to reduce the number of  $\{R_{\hat{x}}, R_{\hat{z}}\}$  to five, similarly to the "ABC" structure described in [12].

#### 3.1.2 MCSU2 decomposition

In this section we provide a decomposition of  $[\Delta_1]_{\{Q_1\setminus t\}}[Z]_t^{C_1}$ ,  $[\Delta_2]_{\{Q_2\setminus t\}}[Z]_t^{C_2}$  and their inverse gates in terms of Clifford+T gates, as required for the structure described in Section 3.1.1. As mentioned, we can assume that only the first two control bits in each of the subsets  $C_1$  and  $C_2$  may be neighboring, and the target qubit t is located at the bottom.

First, we define the following notation.

$$\{\overline{Z}\}_{C,Q}^{l} = \prod_{j=l_0}^{l} [\overline{Z}]_{q_j}^{C^j} , [\overline{Z}]_{q_j}^{C^j} = \begin{cases} I & q_j \in C \\ [Z]_{q_j}^{C^j} & q_j \notin C \end{cases}$$

with  $Q = \{q_1, q_2, ..., q_k\}$ ,  $C = \{c_1, c_2, ..., c_n\} \in Q \setminus q_k$  and  $l_0 \le l \le k$ . The subset  $C^j := Q[1:j-1] \cap C$  with  $2 \le j \le k$  holds all control qubits above  $q_j$ . In addition,  $l_0$  is defined as the index of the first non-control qubit in Q, i.e. the smallest index which satisfies  $q_{l_0} \notin C \land C^{l_0} \ne \emptyset$  and  $2 \le l_0 \le k$ .

**Lemma 13.** If  $l_0 < k$ , and  $q_l \notin C$  with  $l_0 < l \le k$ , then  $\{\overline{Z}\}_{C,Q}^l = [X]_{q_{f_l}}^{Q[f_l+1:l]} \{\overline{Z}\}_{C,Q}^{l-1} [X]_{q_{f_l}}^{Q[f_l+1:l]}$ , with  $f_l$  as the smallest index which satisfies  $\{\overline{Z}\}_{C,Q}^{l-1} = \{\overline{Z}\}_{C,Q}^{f_l}$  and  $l_0 \le f_l < l$ .

*Proof.* The definition of  $f_l$  implies  $q_{f_l} \notin C$ , and  $Q[f_l + 1 : l] \setminus q_l \in C$ . In other words,  $q_{f_l}$  is the nearest non-control qubit above  $q_l$ . Therefore,

$$\{\overline{Z}\}_{C,Q}^{l-1} = \{\overline{Z}\}_{C,Q}^{f_l} = \{\overline{Z}\}_{C,Q}^{f_l-1}[Z]_{q_{f_l}}^{C^{f_l}}$$

Since the qubits  $Q[f_l:l]$  are not used in  $\{\overline{Z}\}_{C,Q}^{f_l-1}$ , this gate commutes with  $[X]_{qf_l}^{Q[f_l+1:l]}$ . Therefore:

$$[X]_{q_{f_{l}}}^{Q[f_{l}+1:l]}\{\overline{Z}\}_{C,Q}^{l-1}[X]_{q_{f_{l}}}^{Q[f_{l}+1:l]}=\{\overline{Z}\}_{C,Q}^{f_{l}-1}[X]_{q_{f_{l}}}^{Q[f_{l}+1:l]}[Z]_{q_{f_{l}}}^{C^{f_{l}}}[X]_{q_{f_{l}}}^{Q[f_{l}+1:l]}$$

According to Lemma 6:

$$[X]_{qf_l}^{Q[f_l+1:l]}[Z]_{qf_l}^{C^{f_l}}[X]_{qf_l}^{Q[f_l+1:l]} = [Z]_{qf_l}^{C^{f_l}}[Z]_{q_l}^{C^{l}}$$

where we used  $[Z]_{q_l}^{C^l} = [Z]^{\{C^{f_l},Q[f_l+1:l]\}}$ , since  $Q[f_l+1:l] \setminus q_l \in C$  and  $q_{f_l} \notin C$  due to the definition of  $f_l$ . Finally, we can substitute and use

$$\{\overline{Z}\}_{C,Q}^{f_l-1}[Z]_{q_{f_l}}^{C^{f_l}}[Z]_{q_l}^{C^l} = \{\overline{Z}\}_{C,Q}^{l-1}[Z]_{q_l}^{C^l} = \{\overline{Z}\}_{C,Q}^{l}$$

Since there are no two neighboring control qubits for  $l_0 < l \le k$ , the only possible values of  $f_l$  are

$$f_{l} = \begin{cases} l - 1 & q_{l}, q_{l-1} \notin C \\ l - 2 & q_{l}, q_{l-2} \notin C \land q_{l-1} \in C \end{cases}$$

by the definition of  $f_l$ . Therefore, the  $[X]_{q_f}^{Q[f_l+1:l]}$  gates in Lemma 13 are restricted to  $[X]_{q_{l-1}}^{q_l}$  or  $[X]_{q_{l-2}}^{q_{l-1},q_l}$ , i.e either LNN CNOT or LNN Toffoi gates which can be implemented up to a relative phase.

Using the following notation

$$[\overline{X}]_{q_{l}} = \begin{cases} I & q_{l} \in C \\ [X]_{q_{l-1}}^{q_{l}} & q_{l}, q_{l-1} \notin C \\ [X_{\Delta}]_{q_{l-2}}^{\{q_{l}, q_{l-1}\}} & q_{l}, q_{l-2} \notin C \land q_{l-1} \in C \end{cases}$$

and repeatedly applying the recursive formula presented in Lemma 13, we achieve the following lemma:

**Lemma 14.** If  $l_0 < k$ ,  $q_k \notin C$  and  $q_l \notin C \lor q_{l+1} \notin C$  for any  $l_0 < l < k$ , then  $\{\overline{Z}\}_{C,Q}^k = \left(\prod_{l=l_0+1}^k [\overline{X}]_{q_l}\right)^{\dagger} [Z]_{q_{l_0}}^{C^{l_0}} \left(\prod_{l=l_0+1}^k [\overline{X}]_{q_l}\right)$  for  $n = |C| \ge 1$  and k = |Q| > n.

Since the first two controls might be neighboring, we get  $|C^{l_0}| \in \{1,2\}$ . Therefore, the gate  $[Z]_{q_{l_0}}^{C^{l_0}}$  is restricted to  $[Z]_{q_{l_0}}^{q_{l_0-1}}$  or  $[Z]_{q_{l_0}}^{\{q_{l_0-2},q_{l_0-1}\}}$ , i.e. either LNN CZ or LNN CCZ. The total number of  $[\overline{X}]_{q_l}$  gates used to implement  $\{\overline{Z}\}_{C,Q}^k$  according to Lemma 14 is  $2(k-l_0)$ . The

The total number of  $[\overline{X}]_{q_l}$  gates used to implement  $\{\overline{Z}\}_{C,Q}^k$  according to Lemma 14 is  $2(k-l_0)$ . The number of LNN  $[X_{\Delta}]$  gates equals  $2|Q[l_0:k]\cap C|=2(n-|C^{l_0}|)$ , i.e. twice the number of control qubits located below  $q_{l_0}$ . The I gates are not required to be implemented, and their number is the same as the number of  $[X_{\Delta}]$  gates. Therefore, the number of LNN CNOT gates is  $2(k-l_0)-4(n-|C^{l_0}|)$ .

number of  $[X_{\Delta}]$  gates. Therefore, the number of LNN CNOT gates is  $2(k-l_0)-4(n-|C^{l_0}|)$ . Since  $Q[1] \in C$ , we get  $l_0 = |C^{l_0}| + 1$  with  $l_0 \in \{2,3\}$ , and we can define  $l'_0 = l_0 - 2$  such that  $l'_0 = 0$  indicates the use of a CZ gate, and  $l'_0 = 1$  indicates the use of a CCZ gate. The number of Toffoli gates can be written as  $2(n-l'_0-1)$ , and the number of CNOT gates as  $2k-4n+2l'_0$ . The following circuits describes the implementations of

$$[\Delta_1]_{\{Q_1\backslash t\}}[Z]_t^{C_1} = \{\overline{Z}\}_{C_1,Q_1}^{k_1-1}[Z]_t^{C_1} = \{\overline{Z}\}_{C_1,Q_1}^{k_1}$$

and

$$[\Delta_2]_{\{Q_2 \setminus t\}}[Z]_t^{C_2} = [S]^{C_2[1:2]}\{\overline{Z}\}_{C_2 \cap Q_2}^{k_2 - 1}[Z]_t^{C_2} = [S]^{C_2[1:2]}\{\overline{Z}\}_{C_2 \cap Q_2}^{k_2}$$

according to Lemma 14, for  $C_1, C_2, Q_1, Q_2$  as the example defined in Section 3.1.1.



Where we add the controlled S (CS) gate in order to transform the CCZ gate to a [iZ], similarly to the ATA case in Section 2.1.2.

We now wish to decompose the structure in terms of basic Clifford+T gates, under the condition that CNOT gates can only apply on nearest neighboring qubits. All of the CNOT and CZ gates applied in the structure are already applied on nearest neighbors. All  $[X_{\Delta}]$  gates are of the same form, with controls  $q_l, q_{l-1}$  and target  $q_{l-2}$ . Each such gate can be implemented as the following LNN restricted circuit:

$$q_{l-2}: \qquad q_{l-2}: \qquad H \qquad T^{\dagger} \qquad H \qquad (19)$$

$$q_{l-1}: \qquad q_{l-1}: \qquad q_{l}: \qquad q_{l}:$$

As in the ATA case (Appendix A), the gates on the left of the barrier can commute with other gates in the structure, and allow for a reduced depth.

In case a CZ gate is used in the circuit, it is simply replaced by a CNOT and two H gates. In case a [iZ] gate is used, it can be replaced by a [iZ] with an additional SWAP gate (which is cancelled out in the full MCSU2 structure) operating on the controls. This gate can be implemented as follows.

$$q_{l_0-2}: \longrightarrow \qquad \qquad q_{l_0-2}: \longrightarrow \qquad \qquad q_{l_0-2}: \longrightarrow \qquad \qquad q_{l_0-1}: \longrightarrow \qquad \qquad q_{l_0}: \longrightarrow \qquad$$

Using the  $l'_0$  notation, the gate  $[Z]_{q_{l_0}}^{C^{l_0}}$  can be replaced with two H and one CNOT gate for  $l'_0 = 0$ , or six CNOT cost/depth along with four T gates in depth two for  $l'_0 = 1$ . Therefore, the CNOT depth/cost is

 $5l'_0 + 1$ , T cost is  $4l'_0$  and depth  $2l'_0$ , and H cost is  $2(1 - l'_0)$ . The H gates can be applied on the top qubit and not contribute to the depth. Lemma 15 follows.

**Lemma 15.** A  $\{\overline{Z}\}_{C,Q}^k$  gate with  $n \geq 3$  and k > n such that only the first two control qubits may be neighboring, and  $q_k \notin C$  can be implemented up to a SWAP-CS gate on two neighboring controls using  $CNOT \cos 2k + 6n - 3l_0' - 9$  and  $depth 2k + 2n + l_0' - 1$ ,

```
T = cost \ 8n - 4l'_0 - 8 and depth 2n,

H = cost \ 4n - 6l'_0 - 2 and depth 2n - 2l'_0.
```

This structure can be used in order to efficiently decompose any multi-controlled SU(2) operator using the LNN restricted version of Lemma 5, as discussed in Section 3.1.1. The only assumption is that the target qubit is located at the bottom of the circuit. Each of the large MCZ gates followed by a relative phase gate can be implemented as  $\{\overline{Z}\}_{C_1,Q_1}^{k_1}$  or  $\{\overline{Z}\}_{C_2,Q_2}^{k_2}$  with or without a CS and SWAP gate applied on two neighboring control qubits.

If applied on the structures corresponding to  $C_1$ , the SWAP gates are simply cancelled out since the first two qubits are unaffected by the structure corresponding to  $C_2$ . If applied on the structures corresponding to  $C_2$ , the SWAP gates must commute with the  $[\Delta_1^{\dagger}]$  gate in order to cancel out. From the definition of  $\{\overline{Z}\}_{C_1,Q_1}^{k_1}$ , it is clear that swapping any two neighboring non-control qubits has no effect, and thus the SWAP gates cancel out in this case as well.

Each cost and depth described in Lemma 15 can be written as  $a_kk + a_nn + b_ll'_0 + b$ . The total costs and depths of the full MCSU2 structure can therefore be written as  $2a_k(k_1 + k_2) + 2a_nn + 2b_l(l'_{0,1} + l'_{0,2}) + 4b$ . Noting that  $Q[1:2] \notin C_2$  for  $l'_{0,1} = 1$ , and  $Q[1:3] \notin C_2$  for  $l'_{0,1} = 0$ , we can write  $k_2 \leq k_1 - 3 + l'_{0,1}$  and  $k_1 = k$ . So, the costs and depths in the worst case can be written as

$$4a_kk + 2a_nn + 2a_k(l'_{0,1} - 3) + 2b_l(l'_{0,1} + l'_{0,2}) + 4b_l(l'_{0,1} + l'_$$

The depth of H gates is reduced by three due to paralleled gates in the full structure, and in addition, if  $l'_{0,1} = 0$ , two H gates can be cancelled out. By choosing the values of  $l'_{0,1}$  and  $l'_{0,2}$  which maximize each cost and depth we achieve the following upper bounds.

**Lemma 16.** Any multi-controlled SU(2) gate with  $n \geq 6$  controls, can be implemented over k > n qubits in LNN connectivity such that the target qubit is located at the bottom, using 8 gates from  $\{R_{\hat{x}}, R_{\hat{z}}\}$ , in addition to no more than

Finally, in order to account for any possible qubit permutation, we discuss the added cost in case  $t \neq q_k$ . Applying a SWAP chain before and after the MCSU2 gate allows to relocate the target qubit. Noting that the LNN requirements hold if the order of Q is reversed, the structure can be used if t is either above or below all qubits in C. The maximal distance of t from the nearest edge is  $\lfloor \frac{k-1}{2} \rfloor$  qubits, and therefore the maximal number of required SWAP gates is  $2\lfloor \frac{k-1}{2} \rfloor$ , each equivalent to three CNOT gates. We show in Appendix D that the maximal CNOT cost and depth of these SWAP gates can be reduced to  $4\lfloor \frac{k-1}{2} \rfloor$  and  $2\lfloor \frac{k-1}{2} \rfloor + 4$  respectively. The following therefore follows from Lemma 16.

**Theorem 6.** Any multi-controlled SU(2) gate with  $n \ge 6$  controls, can be implemented over k > n qubits in LNN connectivity with any qubit permutation using 8 gates from  $\{R_{\hat{x}}, R_{\hat{z}}\}$ , in addition to no more than CNOT cost 10k + 12n - 50 and depth 9k + 4n - 5,

```
T cost 16n - 32 and depth 4n,

H cost 8n - 10 and depth 4n - 3.
```

We recall that in the ATA case, additional ancilla qubits are a useful resource which may assist in reducing gate counts (Appendix C). In the LNN case, ancilla qubits may be seen as an obstacle, as these increase

the value of k and therefore the number of CNOT gates as shown above. However, in certain cases, the ancilla qubits can be used in order to reduce the gate count in LNN connectivity. We discuss this, along with further cost reductions for specific cases in Appendix E.

#### 3.1.3 Multi-controlled multi-target SU(2)

We provide an implementation for any LNN restricted MCMTSU2 gate with a set  $C = \{c_1, c_2, ..., c_n\}$  of n controls and  $\tau = \{t_1, t_2, ..., t_m\}$  of m targets implemented over a set  $Q' = \{q_1, q_2, ..., q_{k'}\}$  of LNN qubits. The set  $Q \in Q'$  of size k is defined in this case as the smallest set of LNN qubits which satisfies  $C \cup \tau \in Q$  and  $q_k \notin C$ .

At first we assume that each target qubit in  $\tau$  is located below all qubits in C, i.e.  $C^j = C$  for any  $q_j \in \tau$ . In this case, recalling the ATA MCMTSU2 case in Lemma 10, the structure provided for a single target LNN MCSU2 can be directly used, along with 8(m-1) additional single qubit rotations from  $\{R_{\hat{x}}, R_{\hat{z}}\}$  in order to implement the LNN MCMTSU2. This is achieved since  $\{\overline{Z}\}_{C_1,Q_1}^{k_1}$  and  $\{\overline{Z}\}_{C_2,Q_2}^{k_2}$  include  $\prod_{t\in\tau}[Z]_t^{C_1}$  and  $\prod_{t\in\tau}[Z]_t^{C_2}$  respectively, and no other MCZ gate is applied on qubits from  $\tau$ .

However, in order to guarantee that all target qubits are at the bottom of the circuit, many SWAP chains must be applied and in a random qubit permutation, this may result in a quadratic overhead of O(mk) CNOT gates. In order to maintain a linear (O(k)) cost for each basic gate type, we provide a method which can be applied for any permutation in which  $q_k \notin C$ . In case  $q_k \in C$ , two partial SWAP chains can be used in order to relocate one qubit as before. Since in this case, any non-control qubit can be chosen, the maximal distance from the nearest edge is  $\lfloor \frac{n}{2} \rfloor$  qubits. The CNOT cost overhead in the worst case is  $4\lfloor \frac{n}{2} \rfloor$  in depth  $2\lfloor \frac{n}{2} \rfloor + 4$ , as shown in Appendix D.

The macro structure in this case is similar to Circuit (8), with  $[\Delta]$  gates applied on subsets of  $Q \setminus \tau$ , and the subsets  $C_1$  and  $C_2$  are defined similarly to the LNN MCSU2 case with one additional condition - for each of the subsets  $C_1, C_2$  it must hold that any consecutive subset of target qubits from  $\tau$  may not have two neighboring control qubits.

A similar procedure to the one defined for MCSU2 can be used while treating the qubit above and below consecutive targets as nearest neighbors. In addition, in order to simplify this case, we are not allowing for any nearest neighboring controls in either subset, which in turn removes the [iZ] gate from this structure.

Starting with  $C_1 = \emptyset$ ,  $C_2 = \emptyset$ , the iteration index l = 1, and an additional index l' = 1. At each iteration,  $q_l$  is added to  $C_1$  if

$$q_l \in C \land q_{l'} \not\in C_1$$
.

Such that the qubit  $q_{l'}$  holds the previous non-target qubit if l > 1. If this condition is not met and  $q_l \in C$ , the qubit  $q_l$  is added to  $C_2$ . If  $q_l \notin \tau$ , the index l' is then set to the value of l. Finally, l is increased by 1 for the next iteration, stopping when  $C_1 \cup C_2$  holds all of the control qubits in C, i.e.  $n_1 + n_2 = n$ .

The following circuit provides an example of the macro structure achieved for an arbitrary choice of  $C, \tau$  and Q.



We now wish to decompose the  $[\Delta_1]_{Q_1\setminus \tau}\prod_{j=1}^m[Z]_{t_j}^{C_1}$  and  $[\Delta_2]_{Q_2\setminus \tau}\prod_{j=1}^m[Z]_{t_j}^{C_2}$  gates. We use the  $\{\overline{Z}\}_{C_1,Q_1}^{k_1}$  and  $\{\overline{Z}\}_{C_2,Q_2}^{k_2}$  decompositions as in Lemma 14, and apply CNOT transformations in two steps.

First we remove any  $[Z]_{q_j}^{C_j^j}$  and  $[Z]_{q_j}^{C_j^j}$  for  $q_j \in \tau$  if  $C_1^j \neq C_1$  or  $C_2^j \neq C_2$  respectively. This can be achieved due to our choice of  $C_1$  and  $C_2$  such that sequential target qubits cannot have more than one neighboring control qubit in either control subset.

In case the qubit below the target sequence is not a control, we can use the following.

$$C: \frac{/n}{a:}$$

$$a: \frac{}{t:}$$

$$t: \frac{}{a:}$$

$$a: \frac{}{t:}$$

$$a: \frac{}{a:}$$

$$a: \frac{}{t:}$$

$$a: \frac{}{a:}$$

$$a: \frac{}{t:}$$

And if the qubit below is a control (and the qubit above is not a control), we use the following.

In our example,  $\{\overline{Z}\}_{C_1,Q_1}^{k_1} = [\Delta_1']_{Q_1 \backslash t_m}[Z]_{t_m}^{C_1}$  is converted to  $[\Delta_1]_{Q_1 \backslash \tau}[Z]_{t_m}^{C_1}[Z]_{t_{m-1}}^{C_1}$  as follows.



This process adds two CNOT gates per target which we "disconnect", and therefore increases the CNOT cost and depth by no more then 2m.

Once these target qubits are not affected by any gate, we wish to add  $[Z]_{q_j}^{C_1}$  or  $[Z]_{q_j}^{C_2}$  gates for each  $q_j \in \tau$  qubit which is located above the bottom control qubit. Using the identity described in Circuit (9), a pair of CNOT gates can be applied on both sides of the structure in order to "add a target" to the MCMTZ gate. For each added target, we utilize the nearest qubit on which an MCZ gates controlled by the entire control subset is already applied. The first pair of CNOT gates will therefore be applied on the qubit neighboring below the bottom control, and the nearest target qubit above it. The second pair will be applied on this added target and the nearest one above it and so on. The following demonstrates this on the structure controlled by  $C_1$  in our example.



This requires long range CNOT gates which can be achieved using SWAP gates. However, in order to lower the cost, we can allow for additional CNOT gates to be implemented, if those only result in relative phase gates applied on  $Q \setminus \tau$ .

If a pair of CNOT gates controlled by a qubit from  $Q \setminus \{C_1, \tau\}$  (the same holds for  $C_2$ ) is applied on a qubit from  $Q \setminus C_1$ , this effectively adds an MCZ gate controlled by  $C_1$ , and applied on the  $Q \setminus \{C_1, \tau\}$  qubit,

and therefore it can be included in the  $[\Delta]$  gate. Moreover, if a CNOT gate controlled by a qubit from  $C_1$  is applied on a qubit from  $Q \setminus C_1$ , this effectively adds an MCZ gate which is only applied on  $C_1$ , and thus can be included in the  $[\Delta]$  gate as well. The following can therefore be used to replace long range CNOT gates in this case.



Defining d as the distance from the control qubit to the target qubit of the long-range CNOT gate, the cost and depth of this structure is 2d-1. In an arbitrary permutation, this structure can be applied up-to 2m times for the implementation for each MCMTZ gate, at a total cost and depth of  $2\sum_{j=1}^{m}(2d_j-1)$ , with  $d_j$  as the distance of each long range CNOT pair, such that  $\sum_{j=1}^{m}(d_j) \leq k$ . The added cost and depth of CNOT gates for this step is therefore no larger than 4k-2m.

Adding all of these steps together, the MCMTSU2 costs and depths are equivalent to the MCSU2 ones, with  $l'_{0,1} = l'_{0,2} = 0$ , such that CNOT cost and depth is increased by no more than 16k, and 8(m-1) gates from  $\{R_{\hat{x}}, R_{\hat{z}}\}$  are added. Since neighboring controls are not allowed in  $C_1$ , we can only guarantee  $q_1 \notin C_2$ , and we can write  $k_2 \leq k-1$ . Adding the CNOT overhead of  $4\lfloor \frac{n}{2} \rfloor$  in depth  $2\lfloor \frac{n}{2} \rfloor + 4$ , the following is achieved similarly to Theorem 6.

**Theorem 7.** Any multi-controlled multi-target SU(2) gate with  $n \geq 6$  controls and  $m \geq 1$  targets, can be implemented over  $k \geq n + m$  qubits in any permutation in LNN connectivity using 8m gates from  $\{R_{\hat{x}}, R_{\hat{z}}\}$ , in addition to no more than

```
CNOT\ cost\ 24k+14n-40\ and\ depth\ 24k+5n-4, T\ cost\ 16n-32\ and\ depth\ 4n, H\ cost\ 8n-8\ and\ depth\ 4n.
```

We note that some depth reductions are achieved in many cases, considering the orientation of sequential "V"-shaped structures such as Circuit (26) and Circuit (17). The structure in Circuit (27) can be used instead of Circuit (26) in some cases in order to reduce the CNOT depth, while increasing the count of H gates.

### 3.2 Multi-controlled X

#### 3.2.1 MCX

Using the construction described in Circuit (11) for the case of LNN restricted architecture, we assume that a dirty ancilla qubit a is located below all qubits in  $\{C, t\}$  and use Lemma 16 in order to apply an  $[R_{\hat{y}}(2\pi)]_a^{\{C,t\}} = [Z]_t^C$  gate with  $\{R_{\hat{x}}, R_{\hat{z}}\}$  gates replaced by four H gates. The location of the target does not matter, as it can be treated as one of the control qubits of the MCZ gate.

The total CNOT cost and depth added in order to locate the ancilla qubit at the bottom is no larger than  $4\lfloor \frac{n+1}{2} \rfloor$  and  $2\lfloor \frac{n+1}{2} \rfloor + 4$  respectively, similarly to the MCSU2 case, while now the ancilla closest to one of the edges can be labelled as a, and therefore, the largest distance is  $\lfloor \frac{n+1}{2} \rfloor$ .

All other costs and depths are achieved from the MCSU2 case as in Lemma 16 with n replaced by n + 1. Finally, two Hadamard gates are applied on the MCX target qubit on both sides of the structure. This provides the following.

**Theorem 8.** Any multi-controlled X gate with  $n \geq 5$  controls, can be implemented over  $k \geq n+2$  qubits in any permutation in LNN connectivity using no more than

```
CNOT\ cost\ 8k + 14n - 34\ and\ depth\ 8k + 5n + 1, \\ T\ cost\ 16n - 16\ and\ depth\ 4n + 4, \\ H\ cost\ 8n + 4\ and\ depth\ 4n + 3.
```

#### 3.2.2 MCMTX

Similarly to the ATA case, if the number of targets is larger than one, an ancilla qubit is not required. We assume that a non-control qubit is located at the bottom and treat it as an ancilla qubit in order to implement an MCZ gate applied on the set  $\{C, q_j\}$  such that  $q_j \in \tau$  is the lowest target qubit which satisfies j < k. The MCZ gate is implemented using Theorem 8. Each additional target qubit is added similarly to the ATA case by applying a CNOT gate on each side. In the LNN case we will add a new target using the closest non-control qubit to it, on which an MCZ controlled by C is already implemented. It is clear that in this case, the long range CNOT gates can be implemented up-to a relative phase as follows.



Finally, a pair of H gates is applied on each target qubit in order to transform the MCMTZ to MCMTX. Each long distance CNOT gate is implemented using four H gates, however, many of these gates cancel out, resulting in an addition of no more than 2m-2 H gates. Noticing the similarities to the process described in Section 3.1.3, it can be realised that in addition to the costs described in Theorem 8, this process requires no more than 4k-2m CNOT gates.

**Theorem 9.** Any multi-controlled multi-target X gate with  $n \ge 5$  controls and  $m \ge 2$ , can be implemented over  $k \ge n + m$  qubits in any permutation in LNN connectivity using no more than

```
CNOT\ cost\ 12k + 14n - 2m - 34\ and\ depth\ 12k + 5n - 2m + 1, \ T\ cost\ 16n - 16\ and\ depth\ 4n + 4, \ H\ cost\ 8n + 2m + 2\ and\ depth\ 4n + 2m + 1.
```

### 3.3 Multi-controlled U(2)

### 3.3.1 MCMTU2

Similarly to the ATA case, we can use the MCMTSU2 structure in order to implement an MCMTU2, by using the clean ancilla as one of the targets and apply a multi controlled rotation about the  $\hat{z}$  axis on it in order to account for the overall required phase. The MCMTU2 can therefore be implemented using Theorem 7, with the adjustments mentioned in Theorem 5.

Noting that when the number of targets is small, it is beneficial to apply chains of partial swap gates in order to place all of the targets at the bottom.

An MCU2 gate requires one target and one clean ancilla qubit to be located at the bottom. In the worst case, the sum of the distances of these qubits from one of the edges is no larger than k-2. This can be realized by noting that if  $d_1$  and  $d_2$  represent the distances of each of the qubits from the bottom edge, then the total number of necessary swaps (pairs of SWAP gates) is  $D = d_1 + d_2 - 1$ . Similarly, the distance from each of these qubits to the top edge is k-d-1, and the number of swaps will be D' = 2k - D - 4. As we

can choose which edge to use, the required number of swaps is no larger than min(D', D). The worst case is achieved when D = D' and therefore, D = k - 2. We note that if the qubit which is nearest to the chosen edge is relocated first, partial SWAP gates can be used in this case, as can be realised from Appendix D. The total number of CNOT gates required in order to place the target and the ancilla qubits at the bottom is no larger than 4k - 8, in depth 2k + 8. The following can be achieved using Lemma 16.

**Theorem 10.** Any multi-controlled U(2) gate with  $n \geq 6$  controls can be implemented over k > n+1 qubits with one ancilla qubit in state  $|0\rangle$  in any permutation in LNN connectivity using 11 gates from  $\{R_{\hat{x}}, R_{\hat{z}}\}$ , in addition to no more than

```
CNOT cost 12k + 12n - 56 and depth 10k + 4n,

T cost 16n - 32 and depth 4n,

H cost 8n - 8 and depth 4n - 1.
```

Our upper bounds for LNN multi-controlled single target gates over any qubit permutation are summarized in Table 2. The leading terms of the cost and depth of the CNOT and T gates are presented.

| Gate  |           | CN        | OT       | Т    |       |
|-------|-----------|-----------|----------|------|-------|
| Type  | Ancilla   | Cost      | Depth    | Cost | Depth |
| MCSU2 | None      | 10k + 12n | 9k + 4n  | 16n  | 4n    |
| MCX   | One dirty | 8k + 14n  | 8k + 5n  | 16n  | 4n    |
| MCU2  | One clean | 12k + 12n | 10k + 4n | 16n  | 4n    |

Table 2: A summary of our LNN results.

Previously known methods, such as the MCX decomposition described in [13], initially assume a specific qubit ordering. The gates used in this implementation are from the NCV library. By converting their results to fault-tolerant gates, these could be compared to Theorem 14 and Theorem 15 in Appendix E. When implemented over a large circuit with arbitrary qubit permutation, the CNOT cost due to communication overhead, which is required to reorder the qubits, scales quadratically as O(nk). In contrast, our methods maintain a linear O(k) scaling over any qubit ordering, even for the multi-controlled multi-target versions of our implementations.

## 4 Conclusion

In this paper, we presented new methods for the decomposition of various types of multi-controlled gates. We have focused on the implementation of MCSU2, MCX and MCU2, along with their multi-target versions, in both ATA and LNN connectivities. The MCSU2 gate is implemented without ancilla, while MCX and MCU2 use one dirty ancilla and one clean ancilla, respectively. The underlying techniques used to decompose these gates are closely related.

The decomposition of each type of gate requires a significantly lower number of basic Clifford+T gates, compared to previous methods which use the same ancilla requirements. In the ATA case, our cost reductions for the CNOT and T gates are 25% - 62.5%, and up to 50%, respectively. More strikingly, for our LNN implementations, the cost of each type of gate scales linearly with regards to the number of qubits in the circuit, regardless of the qubit ordering. To our knowledge, previous methods require a CNOT cost which scales quadratically for arbitrary qubit ordering. Moreover, we have shown in Appendix C and Appendix E that our results can be further improved if additional ancilla qubits are available in some cases.

As many quantum algorithms are designed using multi-controlled gates, our results directly improve the implementations of these algorithms in term of fewer basic gates, which in turn will provide a significant reduction in error rates.

## Acknowledgements

This work was supported by the Engineering and Physical Sciences Research Council [grant numbers EP/R513143/1, EP/T517793/1].

## Competing Interests

BZ and SB declare a relevant patent application: United Kingdom Patent Application No. 2400486.3

## References

- [1] M. A. Nielsen and I. L. Chuang, "Quantum Computation and Quantum Information: 10th Anniversary Edition," Dec. 2010, iSBN: 9780511976667 Publisher: Cambridge University Press. [Online]. Available: https://www.cambridge.org/highereducation/books/quantum-computation-and-quantum-information/01E10196D0A682A6AEFFEA52D53BE9AE
- [2] V. V. Shende, S. S. Bullock, and I. L. Markov, "Synthesis of quantum logic circuits," in *Proceedings* of the 2005 Asia and South Pacific Design Automation Conference, ser. ASP-DAC '05. New York, NY, USA: Association for Computing Machinery, Jan. 2005, pp. 272–275. [Online]. Available: https://dl.acm.org/doi/10.1145/1120725.1120847
- [3] J. M. Arrazola, O. D. Matteo, N. Quesada, S. Jahangiri, A. Delgado, and N. Killoran, "Universal quantum circuits for quantum chemistry," *Quantum*, vol. 6, p. 742, Jun. 2022, publisher: Verein zur Förderung des Open Access Publizierens in den Quantenwissenschaften. [Online]. Available: https://quantum-journal.org/papers/q-2022-06-20-742/
- [4] J. H. A. De Carvalho and F. M. D. P. Neto, "Parametrized Constant-Depth Quantum Neuron," IEEE Transactions on Neural Networks and Learning Systems, pp. 1–12, 2024. [Online]. Available: https://ieeexplore.ieee.org/document/10180218/
- [5] R. Iten, R. Colbeck, I. Kukuljan, J. Home, and M. Christandl, "Quantum circuits for isometries," *Physical Review A*, vol. 93, no. 3, p. 032318, Mar. 2016. [Online]. Available: https://link.aps.org/doi/10.1103/PhysRevA.93.032318
- [6] E. Malvetti, R. Iten, and R. Colbeck, "Quantum Circuits for Sparse Isometries," Quantum, vol. 5, p. 412, Mar. 2021, publisher: Verein zur Förderung des Open Access Publizierens in den Quantenwissenschaften. [Online]. Available: https://quantum-journal.org/papers/q-2021-03-15-412/
- [7] D. K. Park, F. Petruccione, and J.-K. K. Rhee, "Circuit-Based Quantum Random Access Memory for Classical Data," *Scientific Reports*, vol. 9, no. 1, p. 3949, Mar. 2019. [Online]. Available: https://www.nature.com/articles/s41598-019-40439-3
- [8] L. K. Grover, "Quantum Mechanics Helps in Searching for a Needle in a Haystack," *Physical Review Letters*, vol. 79, no. 2, pp. 325–328, Jul. 1997, publisher: American Physical Society. [Online]. Available: https://link.aps.org/doi/10.1103/PhysRevLett.79.325
- [9] A. Tănăsescu, D. Constantinescu, and P. G. Popescu, "Distribution of controlled unitary quantum gates towards factoring large numbers on today's small-register devices," *Scientific Reports*, vol. 12, no. 1, p. 21310, Dec. 2022, publisher: Nature Publishing Group. [Online]. Available: https://www.nature.com/articles/s41598-022-25812-z
- [10] J. D. S. Silva, T. M. D. Azevedo, I. F. Araujo, and A. J. da Silva, "Linear decomposition of approximate multi-controlled single qubit gates," Oct. 2023, arXiv:2310.14974 [quant-ph]. [Online]. Available: http://arxiv.org/abs/2310.14974

- [11] R. Vale, T. M. D. Azevedo, I. C. S. Araújo, I. F. Araujo, and A. J. da Silva, "Decomposition of Multi-controlled Special Unitary Single-Qubit Gates," Feb. 2023, arXiv:2302.06377 [quant-ph]. [Online]. Available: http://arxiv.org/abs/2302.06377
- [12] A. Barenco, C. H. Bennett, R. Cleve, D. P. DiVincenzo, N. Margolus, P. Shor, T. Sleator, J. A. Smolin, and H. Weinfurter, "Elementary gates for quantum computation," *Physical Review A*, vol. 52, no. 5, pp. 3457–3467, Nov. 1995, publisher: American Physical Society. [Online]. Available: https://link.aps.org/doi/10.1103/PhysRevA.52.3457
- [13] X. Cheng, Z. Guan, and W. Ding, "Mapping from multiple-control Toffoli circuits to linear nearest neighbor quantum circuits," *Quantum Information Processing*, vol. 17, no. 7, p. 169, May 2018. [Online]. Available: https://doi.org/10.1007/s11128-018-1908-8
- [14] Y. He, M.-X. Luo, E. Zhang, H.-K. Wang, and X.-F. Wang, "Decompositions of n-qubit Toffoli Gates with Linear Circuit Complexity," *International Journal of Theoretical Physics*, vol. 56, no. 7, pp. 2350–2361, Jul. 2017. [Online]. Available: https://doi.org/10.1007/s10773-017-3389-4
- [15] S. Balauca and A. Arusoaie, "Efficient Constructions for Simulating Multi Controlled Quantum Gates," in Computational Science ICCS 2022, ser. Lecture Notes in Computer Science, D. Groen, C. de Mulatier, M. Paszynski, V. V. Krzhizhanovskaya, J. J. Dongarra, and P. M. A. Sloot, Eds. Cham: Springer International Publishing, 2022, pp. 179–194.
- [16] B. Zindorf and S. Bose, "Quantum Computing with Hermitian Gates," Feb. 2024, arXiv:2402.12356 [quant-ph]. [Online]. Available: http://arxiv.org/abs/2402.12356
- [17] D. Maslov, "Advantages of using relative-phase Toffoli gates with an application to multiple control Toffoli optimization," *Physical Review A*, vol. 93, no. 2, p. 022311, Feb. 2016, publisher: American Physical Society. [Online]. Available: https://link.aps.org/doi/10.1103/PhysRevA.93.022311
- [18] D. Maslov, G. Dueck, and D. Miller, "Toffoli network synthesis with templates," *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems*, vol. 24, no. 6, pp. 807–817, Jun. 2005, conference Name: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/1432873
- [19] D. Maslov and B. Zindorf, "Depth Optimization of CZ, CNOT, and Clifford Circuits," *IEEE Transactions on Quantum Engineering*, vol. 3, pp. 1–8, 2022, conference Name: IEEE Transactions on Quantum Engineering. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/9792395

# A ATA MCSU2 depth reductions

We provide a description of the depth reductions reported in Theorem 1. The relative phase Toffoli can be expressed as  $[X_{\Delta}]_{q_3}^{\{q_1,q_2\}} = [O_3]_{\{q_1,q_2,q_3\}}[O_2]_{\{q_1,q_3\}}$ , such that  $[O_2]_{\{q_1,q_3\}}$  and  $[O_3]_{\{q_1,q_2,q_3\}}$  are defined as the gates on the left of the barrier and on the right of the barrier in Circuit (6) respectively. The expression in Lemma 8 can be rewritten using these notations, noting that  $\prod_{j=3}^n [X_{\Delta}]_{d_{j-2}}^{\{c_j,d_{j-1}\}} = (\prod_{j=3}^n [O_3]_{\{c_j,d_{j-1},d_{j-2}\}})(\prod_{j=3}^n [O_2]_{\{c_j,d_{j-2}\}})$ . This commutation is achieved because each  $[O_2]_{\{q_1,q_3\}}$  gate is the first to access the qubits  $q_1,q_3$ , as can be seen in Circuit (5). We define the following notations:

$$\Lambda_{C,D}^{O_3} = \left(\prod_{j=3}^n [O_3]_{\{c_j,d_{j-1},d_{j-2}\}}\right)^{\dagger} [iZ]_{d_1}^{\{c_1,c_2\}} \left(\prod_{j=3}^n [O_3]_{\{c_j,d_{j-1},d_{j-2}\}}\right)$$

and

$$\Xi_{C,D}^{O_2} = \prod_{j=3}^{n} [O_2]_{\{c_j,d_{j-2}\}}$$

and we can rewrite Lemma 8 as

**Lemma 17.** 
$$\{Z\}_{C,D}^n[S]_{C_2}^{c_1} = (\Xi_{C,D}^{O_2})^{\dagger} \Lambda_{C,D}^{O_3}(\Xi_{C,D}^{O_2}) \text{ for } n \geq 3, |C| = n, \text{ and } |D| = n - 1.$$

Therefore, any  $\{Z\}_{C,D}^n[S]_{c_2}^{c_1}$  gate can be decomposed using one [iZ] gate, 2(n-2)  $[O_3]_{\{q_1,q_2,q_3\}}$  gates, and 2(n-2)  $[O_2]_{\{q_1,q_3\}}$  gates. Since  $[O_2]_{\{q_1,q_3\}}$  gates can be applied simultaneously in each  $\Xi$  operator, the depth is reduced to one [iZ] gate, 2(n-2)  $[O_3]_{\{q_1,q_2,q_3\}}$  gates and two  $[O_2]_{\{q_1,q_3\}}$  gates, as presented in Lemma 9. We define the following control subsets:

$$C_1 = C[1:n_1]$$

$$C_2 = C[n_1 + 1:n_1 + n_2]$$

$$C'_1 = \begin{cases} C_1[3:n_2] & n_2 = n_1 \\ \{C_1[3:n_1], C_1[2]\} & n_2 \neq n_1 \end{cases}$$

$$C'_2 = C_2[3:n_1]$$

Such that  $n_1 = \lfloor \frac{n}{2} \rfloor$ ,  $n_2 = n - n_1 = \lceil \frac{n}{2} \rceil$ ,  $|C_1'| = n_2 - 2$ , and  $|C_2'| = n_1 - 2$ . We choose  $[\Delta_1]_C = [S]^{C_1[1:2]} \{Z\}_{C_1,C_2'}^{n_1-1}$  and  $[\Delta_2]_C = [S]^{C_2[1:2]} \{Z\}_{C_2,C_1'}^{n_2-1}$  as relative phase gates applied on the set C. For this choice we get

$$[\Delta_1]_C[Z]_t^{C_1} = [S]^{C_1[1:2]} \{Z\}_{C_1,\{C_2',t\}}^{n_1}$$

and

$$[\Delta_2]_C[Z]_t^{C_2} = [S]^{C_2[1:2]} \{Z\}_{C_2,\{C',t\}}^{n_2}$$

The inverted versions of these gates are implemented in the same way. These structures can be used in order to decompose Circuit (2). Our choice of the sets  $C_1, C_2, C_1', C_2'$  allows for a slight reduction of T depth, considering the full structure. Since the  $\Xi$  operators commute with single qubit gates applied on the target qubit t, there are three copies of  $(\Xi_{C_1,C_2'}^{O_2})(\Xi_{C_2,C_1'}^{O_2})^{\dagger}$  or its inverse formed in layers between consecutive  $\Lambda$  operators. We note that  $C_2'[j-2] = C_2[j]$ , and  $C_1'[j-2] = C_1[j]$  for  $3 \le j \le n_1$ . Therefore,

$$(\Xi^{O_2}_{C_1,C_2'})(\Xi^{O_2}_{C_2,C_1'})^\dagger = \begin{cases} \prod_{j=3}^{n_1} ([O_2]_{\{C_1[j],C_2[j]\}}[O_2^\dagger]_{\{C_2[j],C_1[j]\}}) & n_2 = n_1 \\ \prod_{j=3}^{n_1} ([O_2]_{\{C_1[j],C_2[j]\}}[O_2^\dagger]_{\{C_2[j],C_1[j]\}})[O_2^\dagger]_{\{C_2[n_2],C_1[2]\}} & n_2 \neq n_1 \end{cases}$$

In both cases there are  $n_1 - 2$  parallel copies of the following circuit:



This allows for a reduction of H depth by 3. In addition, the T depth is reduced by 6 if n is even and by 3 if n is odd.

## B ATA MCSU2 alternative

We provide an alternative implementation of the multi-controlled SU(2) with reduced depth of T gates. This is achieved by using the following decompositions of  $[X_{\Delta}]$  and [iZ].

and

In this case,  $[O_2]_{\{q_1,q_3\}}$  and  $[O_3]_{\{q_1,q_2,q_3\}}$  are defined as the gates on the left of the barrier and on the right of the barrier, respectively, in Circuit (29). Using these gates in Lemma 17 provides the following.

**Lemma 18.** A  $\{Z\}_{C,D}^n[S]_{c_2}^{c_1}$  gate with  $n \geq 3$ , |C| = n, and |D| = n - 1 can be implemented using CNOT cost 8n - 11 and depth 6n - 5,

T = cost 8n - 12 and depth 2n,

H cost 4n-8 and depth 2n-2.

When applied as part of Lemma 5, using the sets  $C_1, C_2, C_1', C_2'$  as defined in Appendix A, some reductions in CNOT gate *count* can be achieved as  $(\Xi_{C_1,C_2'}^{O_2})(\Xi_{C_2,C_1'}^{O_2})^{\dagger}$  are now comprised of  $n_1-2$  parallel copies of the following circuit:

These cancellations can be applied in three layers and reduce the CNOT count by  $6(n_1 - 2)$ . In addition, the H depth is reduced by 3, and the CNOT depth is reduced by 6 if n is even, or by 3 if n is odd.

We now show that the CNOT cost can be further reduced by applying cancellations for the first  $n_1 - 2$  [ $X_{\Delta}$ ] gates of the full MCSU2 structure. The following gate is controlled by the state  $|0\rangle$ , and therefore commutes with the MCSU2 gate if its control qubit is in the set C.

$$\begin{array}{ccc}
q_1 & & & & \\
q_2 & & R_{\hat{\mathbf{x}}}(\frac{\pi}{2}) & & & & \\
\end{array} = \begin{array}{ccc}
q_1 & & & & \\
q_2 & & & & \\
\end{array} = \begin{array}{cccc}
q_1 & & & \\
\end{array} = \begin{array}{cccc}
\end{array} (32)$$

This gate and its inverse can be applied on both sides of the MCSU2 structure with  $q_1 = C_1[j]$  and  $q_2 = C_2[j]$  for each value  $3 \le j \le n_1$ . The two following circuits describe the gates which can replace the first  $n_1 - 2$   $[X_{\Delta}]$  gates, and their counterparts at the end of the MCSU2 structure. Setting  $q_3 = \{C'_2, t\}[j-1]$  and  $q_4 = \{C'_1, t\}[j-1]$  for  $3 \le j \le n_1$ , we get the following circuits.

$$q_{1}: \qquad \qquad q_{1}: \qquad \qquad q_{1}: \qquad \qquad T^{\dagger}$$

$$q_{2}: \qquad \qquad R_{\hat{\mathbf{x}}}(\frac{\pi}{2})^{\dagger} \qquad \qquad = q_{2}: \qquad \qquad H$$

$$q_{3}: \qquad \qquad q_{3}: \qquad q$$

Each of these circuits is used to replace  $n_1 - 2$  copies of Circuit (29). This allows for a reduction of  $n_1 - 2$  CNOT gates in depth 1, and an increase of  $2(n_1 - 2)$  H gates in depth 1, along with the introduction of  $n_1 - 2$  S gates in depth 1.

The total reduction of CNOT gates is  $7(n_1 - 2)$ , which is equivalent to 3.5n - 14 if n is even, and 3.5n - 17.5 if n is odd. The following theorem is achieved from Lemma 5, Lemma 18 and the above CNOT cancellations.

**Theorem 11.** Any SU(2) operator controlled by  $n \geq 6$  qubits can be implemented without ancilla using eight gates from  $\{R_{\hat{x}}, R_{\hat{z}}\}$  in addition to

```
CNOT cost 12.5n - 30 (12.5n - 26.5 for odd n) and depth 12n - 27 (12n - 24 for odd n),

T cost 16n - 48 and depth 4n,

H cost 9n - 36 (9n - 37 for odd n) and depth 4n - 10,

S cost 0.5n - 2 (0.5n - 2.5 for odd n) and depth 1.
```

We note that it is possible to use the structure which produces Theorem 1, and use Circuit (29) to replace only the  $6(n_1 - 2)$  [ $X_{\Delta}$ ] gates which allow CNOT cancellations according to Circuit (31). This approach maintains all of the costs described in Theorem 1, while reducing the T depth by 3n + O(1), and increasing the CNOT depth by the same. This provides a T depth of 5n + O(1), with no increase to the CNOT cost.

The results mentioned in Theorem 11 can be used, along with Theorem 3 and Theorem 5 in order to obtain the reduced T depth implementations of MCX and MCU2 gates.

We summarize the results for multi-controlled single target gates with reduced T depth in Table 3. We use the same comparisons as used in Table 1. The depth reduction for T gates is now between 75% - 87.5%, while maintaining low costs.

| Gate  |           |           | CNOT  |       | T    |       |
|-------|-----------|-----------|-------|-------|------|-------|
| Type  | Ancilla   | Source    | Cost  | Depth | Cost | Depth |
| MCSU2 | None      | [11]      | 20n   | 20n   | 20n  | 20n   |
|       |           | Ours      | 12.5n | 12n   | 16n  | 4n    |
| MCX   | One dirty | [5]       | 16n   | 16n   | 16n  | 16n   |
|       |           | Ours      | 12.5n | 12n   | 16n  | 4n    |
| MCU2  | One clean | [12], [5] | 32n   | 32n   | 32n  | 32n   |
|       |           | Ours      | 12.5n | 12n   | 16n  | 4n    |

Table 3: A summary of our results compared to previous methods.

# C ATA MCMTSU2 with dirty ancillae

We provide a method which reduces the cost of each type of Clifford+T gate used in our decompositions if an additional set  $\chi$  of  $n_{\chi} \leq \lfloor \frac{n-6}{2} \rfloor$  dirty ancilla qubits is available. Since any unused qubit in the circuit can be utilized as a dirty ancilla, this reduction can be applied in any scenario in which the multi controlled gate does not operate on all available qubits.

We note that the  $[O_2]$  gates which commute out of the structure corresponding to  $C_1$  to form  $\Xi$  layers, as mentioned in Appendix A, can be cancelled out if those commute with the structure corresponding with  $C_2$ . Such commutations can be achieved by using available dirty ancilla qubits as part of  $C'_2$  as previously defined. Each available dirty ancilla can be used in that way to reduce the cost by four  $[O_2]$  gates, until the number of ancilla reaches n + O(1), and all of these gates have been cancelled out. Here we present a method which allows to double the number of gates which can be cancelled out due to each added ancilla.

The following structure can be used to implement any MCSU2 gate according to Lemma 5, with  $C_1^{\chi} \cup C_1^2 = C_1$ .



The subsets are defined as follows:

$$C_1^\chi = C[1:2n_\chi] \qquad \qquad C_1^2 = C[2n_\chi + 1:2n_\chi + n_1'] \qquad \qquad C_2 = C[2n_\chi + n_1' + 1:n]$$
 
$$C_1' = \begin{cases} C_1^2[3:n_2] & n_2 = n_1' \\ \{C_1^2[3:n_1'], C_1^2[2]\} & n_2 \neq n_1' \end{cases} \qquad \qquad C_2' = \{\chi, C_2[3:n_1']\} \qquad \qquad C_1 = \{C_1^2[1:2], C_1^\chi, C_1^2[3:n_1']\}$$

The set marked as  $C_1^{\chi}$  holds the first  $2n_{\chi}$  qubits of the control set C. The remaining  $n-2n_{\chi}$  control qubits are divided between the sets  $C_1^2$  and  $C_2$  of size  $n'_1 = \lfloor \frac{n-2n_{\chi}}{2} \rfloor$  and  $n_2 = \lceil \frac{n-2n_{\chi}}{2} \rceil$  respectively. The set  $C_1$  holds all  $n_1 = 2n_{\chi} + \lfloor \frac{n-2n_{\chi}}{2} \rfloor$  qubits of sets  $C_1^{\chi}$  and  $C_2^{\chi}$ .

set  $C_1$  holds all  $n_1 = 2n_\chi + \lfloor \frac{n-2n_\chi}{2} \rfloor$  qubits of sets  $C_1^\chi$  and  $C_1^2$ .

The gate  $[\Delta_2]_{\{C_2,C_1^2\}}[Z]_t^{C_2} = [S]^{C_2[1:2]}\{Z\}_{C_2,\{C_1',t\}}^{n_2}$  is decomposed using Lemma 17 as shown in the previous sections. The gate  $[\Delta_1]_{\{C,\chi\}}[Z]_t^{C_1}$  is also decomposed using Lemma 17, while the qubits in  $C_1^\chi$  are arranged as  $n_\chi$  pairs, and each pair is treated as a single control bit. This adjustment is simply achieved by adding a control bit to each  $[X_\Delta]$  gate which is controlled by such a pair. We use the following notations for a three-controlled relative phase Toffoli and its inverse.



Using Lemma 6, we get the following version of Circuit (3).

The following circuit presents the structure used to decompose  $[\Delta_1]_{\{C,\chi\}}[Z]_t^{C_1} = [S]^{C_1[1:2]}\{Z\}_{C_1,\{C_2',t\}}^{n_1}$  according to Lemma 8 while pairs of qubits from  $C_1^{\chi}$  are treated as a single qubit.



In order to decompose this structure in terms of Clifford+T gates, we use the following implementation, as provided in [17].



As can be seen, when this circuit is used in order to decompose Circuit (38), all gates on the left to the barrier can commute with any gate which is applied on the left hand side, and for the inverse gates, a similar set of gate can commute to the right hand side. Moreover, as can be seen from Circuit (35), these gates commute with the structures controlled by  $C_2$ , and thus gates which commute from the right hand side of the first structure controlled by  $C_1$  cancel out with their counterparts from the left of the second structure controlled by  $C_1$ .

The rest of these commuted gates, which commute to the right and to the left of the full MCSU2 structure, cancel out as well since all of these gates operate on a qubit from  $\chi$ , either as a single qubit gate, or as a gate controlled by a qubit from C. Therefore, a similar sequence of these gates and their inverse can be applied (analogically to Circuit (48)) on both sides of the MCSU2 gate shown in Circuit (35) without changing the operator, and cancel out the gates which commute towards the edges. The rest of the  $[X_{\Delta}]$  gates in Circuit (38) are decomposed in the same ways as Circuit (6), creating two  $\Xi$  operators which are applied on  $C'_2 \cup C_1^2$ .

The gate  $[\Delta_1]_{\{C,\chi\}}[Z]_t^{C_1}$  can be implemented using one [iZ] gate,  $2n'_1-4$   $[X_{\Delta}]$  gates, and  $2n_{\chi}$  three-controlled relative phase Toffoli gates. Noting that due to our choice of  $C_1, C_2, C'_1, C'_2$ , the same depth reductions or CNOT cancellations between pairs of  $\Xi$  operators can be achieved  $n'_1-2$  times per layer for gates applied on  $C'_2 \cup C^2_1$ . The total costs and depths of the MCSU2 structure can be simply achieved from Theorem 1, with n replaced by  $n-2n_{\chi}$ , in addition to the costs and depths of  $4n_{\chi}$  reduced three-controlled Toffoli gates, each contributing cost and depth of 4 CNOT, 4 T and 2 H gates. We note that the steps used to produce Theorem 2 can be used in this case as well, as those are agnostic to the decomposition of the large relative phase gates. Theorem 12 simply follows.

**Theorem 12.** Any multi-controlled multi-target SU(2) with  $n \geq 6$  controls, and  $m \geq 1$  targets can be implemented with  $0 \leq n_{\chi} \leq \lfloor \frac{n-6}{2} \rfloor$  dirty ancilla qubits using 8m gates from  $\{R_{\hat{x}}, R_{\hat{z}}\}$  in depth 8, in addition

For a minimized T depth decomposition, the same approach is used, with  $[X_{\Delta}]$  and [iZ] gates implemented as Circuit (29) and Circuit (30), and the three-controlled relative phase Toffoli implemented as the following circuit.



The gates on the left of the barrier are removed in the same manner as described above, and the CNOT cancellations which provide Theorem 11 can be applied for gates operating on  $C_2' \cup C_1^2$ . The total costs and depths of the MCSU2 structure can be simply achieved from Theorem 11, with n replaced by  $n-2n_{\chi}$ , and in addition to the costs and depths of  $4n_{\chi}$  reduced three-controlled Toffoli gates, each contributing 4 T gates in depth 2, in addition to cost and depth of 6 CNOT and 2 H gates.

**Theorem 13.** Any multi-controlled multi-target SU(2) with  $n \geq 6$  controls, and  $m \geq 1$  targets can be implemented with  $0 \leq n_{\chi} \leq \lfloor \frac{n-6}{2} \rfloor$  dirty ancilla qubits using 8m gates from  $\{R_{\hat{x}}, R_{\hat{z}}\}$  in depth 8, in addition to

```
\begin{array}{lll} CNOT\ cost\ 12.5n + 8m - n_{\chi} - 38\ (+3.5\ for\ odd\ n)\ and\ depth\ 12n + 8\lceil log_2(m)\rceil - 27\ (+3\ for\ odd\ n), \\ T & cost\ 16n - 16n_{\chi} - 48 & and\ depth\ 4n, \\ H & cost\ 9n - 10n_{\chi} - 36\ (-1\ for\ odd\ n) & and\ depth\ 4n - 10, \\ S & cost\ 0.5n - n_{\chi} - 2\ (-0.5\ for\ odd\ n) & and\ depth\ 1. \end{array}
```

We note that the number of dirty ancilla qubits can be increased up to  $\lceil \frac{n-2}{2} \rceil$ . This is achieved by allowing the implementation of the second and fourth MCZ and  $[\Delta]$  gates in Circuit (35) to be implemented as [iZ] or CZ gates when  $n_2$  is reduced to 2 or 1. Finally, if  $n_2 = 0$ , these controlled gates are replaced by Z gates, which can commute and cancel out. This final stage might increase the Clifford+T count slightly, however it reduces the number of single qubit arbitrary rotations, and halves the cost/depth of CNOT gates which are added for for m > 1 targets. Note that if this is used, and n is odd, the last control qubit will be added using a  $[X_{\Delta}]$  gate with reduced cost, since the gates which can commute outwards cancel out similarly to the three-controlled Toffoli gates.

The results mentioned in Theorem 12 and Theorem 13 can be used, together with Theorem 3-Theorem 5 in order to obtain the implementations of the MCX, MCMTX, MCU2 and MCMTU2 gates with additional dirty ancilla qubits, noting that for MCMTX, m-2 targets can be used as dirty ancilla if for MCMTX with m>2

We present our CNOT and T costs of MCSU2, MCX and MCU2 as a function of the number of available dirty ancilla qubits in Figure 1. For comparison, we use the known results from Table 1, noting that additional ancilla qubits do not improve the costs in those methods. When around n/2 dirty ancilla qubits are available, our results for MCSU2, MCX an MCU2 are close to those provided by [17] for MCX with a similar number of ancilla. When the number of ancilla is lower, our methods still benefit from this resource, while other methods use at most one of the available ancilla qubits.



Figure 1: Presenting the leading coefficient of CNOT and T costs for a single target, as a function of available dirty ancilla qubit. In our implementations, the leading terms are the same for MCX, MCU2, and MCSU2, and are marked as CX or T cost. Each of the best known methods that we use for comparison (Table 1) provides the same leading term for CNOT and T cost. The results provided for these methods are marked as dots, which are continued as a dashed line if there is no improvement due to additional ancilla. For MCX, the results from [17] are shown for  $\frac{n}{2} + O(1)$  ancilla.

## D LNN communication overhead

In Section 3.1.2, SWAP chains are used in order to relocate the target qubit t of an LNN MCSU2. We present the method used to reduce the CNOT cost required to implement the SWAP chains in Theorem 6. Each SWAP gate can be replaced with a partial SWAP that can be implemented using two CNOT gates instead of three. We use the following gate notations.

such that

$$\begin{array}{ccc}
 & \times & & & \\
 & \times & & & \\
 & \times & & & \\
\end{array} = \begin{array}{ccc}
 & \times & & \\
 & \times & & \\
\end{array} = \begin{array}{cccc}
 & \times & & \\
\end{array}$$

$$(42)$$

Recalling from Lemma 5 that  $[R_{\hat{v}}(\lambda)]_t^C = [A_1]_t [A_3^{\dagger}]_t [R_{\hat{x}}(\lambda)]_t^C [A_4]_t$ , the SWAP chains can be applied as follows.



with the final circuit achieved by applying CNOT cancellations between the SWAP chains. CNOT gates are first "extracted" from each side as follows.

The extracted CNOT gates can be removed since each such gate commutes with  $[A_3^{\dagger}]_t[R_{\hat{x}}(\lambda)]_t^C$ , as both operators apply a rotation about the  $\hat{x}$  axis on the bottom qubit. In addition, the CNOT gates are controlled by qubits which are either controls of the MCSU2 gate, or are not affected by it.

Since each partial SWAP is decomposed as two CNOT gates, the maximal number of CNOT gates required for this reordering is 4d, with d as the "distance" of the swap, i.e. the number of qubits below the location of t. The CNOT depth is no larger than 2(d+2), as can be realised from the following.

In the MCSU2 structure, two additional CNOT gates can be cancelled out, noting that the structure between the partial SWAP chains in Circuit (43) must either begin or end with a  $[X]_{q_{k-1}}^{q_k}$  gate. This is since  $q_{k-1}$  cannot be in both  $C_1$  and  $C_2$ .

A similar analysis can be used for multi-controlled multi-target gates in order to place all of the target qubits at the bottom, however, the method described in Section 3.1.3 can be used instead if the number of targets is large. The MCU2 is implemented as an MCMTSU2 with two target qubits. In this case it is beneficial to place the targets below all control qubits, as can be realised from the cost analysis in Section 3.3.1. The SWAP gates can be replaced by partial SWAP gates in this case as well, by first applying the swap chains for the target qubit which is nearest to the bottom as in Circuit (43), and then repeat a similar process for the second qubit. This can be realised from the following example.



The depth of the swap chains vary slightly, depending on the distance between the target qubits. From Circuit (47), it can be noted that the CNOT depth is no larger than 2(k+4).



## E LNN specific cases

We recall that in the ATA case, additional ancilla qubits are a useful resource, which may assist in reducing gate counts. In the LNN case, ancilla qubits may be seen as an obstacle, as these increase the value of k and therefore the number of CNOT gates as shown in Section 3.1.2. However, in certain cases, the ancilla qubits can be used in order to reduce the gate count in LNN connectivity.

As can be seen in our example in Circuit (16), some of the control qubits of  $C_1$  are located above all  $C_2$  qubits. In this case, the gates which commute out of the  $C_1$  structures to form  $\Xi$  layers (the gates on the left of the barrier in Circuit (19)) may be cancelled out, if those commute with the structure between them, which corresponds to  $C_2$ . The identities described in Circuit (22) and Circuit (23) may be used in some cases to increase the number of gates which can be commuted and cancelled out in this way. In addition, similarly to the ATA case, some of the gates commuting towards the edges of the full MCSU2 structure can be removed by applying gates which commute with the MCSU2 on both sides as follows.



This allows for the following cancellations between the added gates, and the ones commuted to the left from Circuit (19) (similarly for the right side of the MCSU2).

Therefore, each "useful" ancilla allows to remove 8 T gates, 4 CNOT and 4 H gates. These cancellations apply when the ancilla qubit is located directly above the control qubit. However, if the ancilla was originally located directly below the control, it can be swapped as follows.



This has two benefits. First, as can be seen, the two remaining CNOT gates in Circuit (49) can be cancelled out, and thus reducing the CNOT cost by additional 4. Moreover, this swap may increase the number of  $C_1$  qubits located above all  $C_2$  ones, thus allowing for more such cancellations by making more ancilla qubits "useful", and at the same time reduce  $k_2$ , thus reducing the number of CNOT gate even further.

From this arises an important question - in which cases is it beneficial to apply swap gates? As can be seen, if one wishes to minimize the number of T gates, and is willing to "pay" with a quadratic increase of CNOT gates, then many SWAP chains can be applied in order to choose the a qubit ordering which maximizes the number of useful ancilla. Otherwise, one might compare the costs of each possible qubit reordering and select the most efficient result, considering all Clifford+T gates. As this might be expensive in terms of classical computation, we identify some cases in which it is always beneficial to reorder the qubits.

Initially, both  $k_1$  and  $k_2$  may be reduced by relocating the bottom  $\{C, t\}$  qubit upward towards the nearest  $\{C, t\}$  one above it, using two partial SWAP chains. Similarly,  $k_1$  can be reduced by relocating the top qubit downwards, unless this results in an increase of  $k_2$ .

Finally, one can easily identify the first pair of neighboring control qubits which do not include C[1], and the nearest ancilla qubit below this pair. As mentioned above, if the ancilla is directly below this pair, it is always beneficial to apply a swap, and iteratively continue to identify the next neighboring pair of controls. In case the ancilla is not directly below, it is beneficial to relocate it upward if the number of added CNOT gates applied for this relocation is no larger than the number of CNOT gates which this relocation allows to reduce.

We consider the specific control permutation  $\Omega := [1,1,1,0,1,0,1,...,0,1,0,0]$  such that the target qubit t is at the end. After one iteration of the mentioned procedure, a SWAP $(q_3,q_4)$  will be applied, and the new order will be  $\Omega' = [1,1,0,1,1,0,1,...,0,1,0,0]$ , then SWAP $(q_5,q_6)$  will be applied and so on. Finally, the permutation will be  $\Omega^{(n-2)} = [1,1,0,1,0,1,0,1,0,1,0]$ . This case assumes k = 2n - 1, i.e. n - 2 available dirty ancilla qubits, and allows for all  $[X_{\Delta}]$  gates to be implemented as the gates to the right of the barrier in Circuit (19), or their inverse.

Furthermore, all applied SWAP gates are automatically cancelled during the process. In this case,  $k_2 = n_2 = 0$ , and  $l'_{0,1} = 1$ . The complete structure will require two SWAP-[iZ] gates and 4(n-2) reduced  $[X_{\Delta}]$  gates in addition to five  $\{R_{\hat{x}}, R_{\hat{z}}\}$  gates. We note that since  $n_2 = 0$ , in this case the following applies for  $n \geq 3$ .

**Theorem 14.** Any multi-controlled SU(2) gate with  $n \geq 3$  controls can be implemented over k = 2n - 1 qubits, ordered as  $\Omega$  or its reversed string, in LNN connectivity such that the target qubit t is at the top or

```
bottom, using five gates from \{R_{\hat{x}}, R_{\hat{z}}\}, in addition to CNOT cost 12n-12 and depth 12n-12,

T cost 8n-8 and depth 4n-4,

H cost 4n-8 and depth 4n-8.
```

This can be considered as the best-case scenario; however, slightly different permutations may result in a reduction of some costs/depths while increasing others.

We cover an additional specific case in which all qubits in the circuit are used, i.e. k = n + 1, and the target is located at the top or bottom. For this case, C = [1, 1, 1, ..., 1, 1, 0]. It is clear that  $l'_{0,1} = l'_{0,2} = 1$ ,  $k_1 = k$ , and  $k_2 = k - 2$ . The following result can be achieved by following the steps presented in Section 3.1.2.

**Theorem 15.** Any multi-controlled SU(2) gate with  $n \geq 6$  controls can be implemented over k = n + 1 qubits in LNN connectivity such that the target qubit t is at the top or bottom, using 8 gates from  $\{R_{\hat{x}}, R_{\hat{z}}\}$ , in addition to

```
\begin{array}{ll} CNOT\ cost\ 20n-48\ and\ depth\ 12n,\\ T & cost\ 16n-48\ and\ depth\ 4n,\\ H & cost\ 8n-32\ \ and\ depth\ 4n-11. \end{array}
```

The costs and depths of MCX and MCU2 with one dirty/clean ancilla can be obtained in a similar way. As the structures used for these gates are based on the MCSU2 implementation, it is clear that in these cases the leading terms are the same as mentioned above.