# GHZ State Preparation with Parallelism
In this example, we will implement a *Greenberger-Horne-Zeilinger* (GHZ) state preparation circuit with $N = 2^n$ qubits. Our goal is to learn how to deploy QASM files with Bloqade that can be interpreted by neutral-atom quantum computers while making use of specific features of this technology. 

First, we will present the standard linear-depth construction of a GHZ state preparation circuit in Bloqade. Following that, we will show a log-depth construction that achieves the same result and build it in Bloqade. This log-depth construction is very convenient for us, as Bloqade (and QuEra's neutral atom hardware!) support *parallel* gates, allowing for the application of the same gate across multiple qubits simultaneously. We thus take this one step further and, combining the circuit above with the arbitrary connectivity enabled by atom *shuttling*, we showcase how to achieve log-depth not only the circuit logic but also in its *execution*. We will tie it all together with an excursion on how to develop optimized compiler passes with `Bloqade` and `Kirin` to automate workflow.

We beging this excursion with some simple imports from `Bloqade` and `Kirin`:

In [2]:
import math

from bloqade import qasm2
from kirin.dialects import ilist

## Simple Linear Depth Implementation of a GHZ State Preparation Circuit

A simple GHZ state preparation circuit can be built with $N - 1$ CNOT gates and $1$ H gate as follows:

<div align="center">
<picture>
   <img src="GHZ_linear.png" style="width: 35vw; min-width: 330px;" >
</picture>
</div>

Notice that the CNOTs cascade in series, applied to all possible qubits. The number of CNOTs and the depth of the circuit thus grows with the number of qubits $N$.

With Bloqade, we can generate a QASM2 representation of this circuit with the following function:

In [3]:
def ghz_linear(n: int):
    n_qubits = int(2**n)

    @qasm2.extended
    def ghz_linear_program():

        qreg = qasm2.qreg(n_qubits)
        # Apply a Hadamard on the first qubit
        qasm2.h(qreg[0])
        # Create a cascading sequence of CX gates
        # necessary for quantum computers that
        # only have nearest-neighbor connectivity between qubits
        for i in range(1, n_qubits):
            qasm2.cx(qreg[i - 1], qreg[i])

    return ghz_linear_program

This naively simple syntax already hides several Bloqade-specific features that will play a big role in the efficient deployment of circuits.  Notice the use of the `@qasm2.extended` decorator. This is turning the `ghz_linear_program()` function into a kernel that allows the deployment of high-level programming control-flow to our QASM2 representation of the circuit. This includes the ability to declare `for` loops, as well as the utilization of the variable `n_qubits` to generate circuits for variable problem instances. _(for the programming savvy, you may recognize this structure of nesting functions keeping local variables is known as a 'closure')_

Let's print the QASM2 file corresponding to this circuit for 4 qubits:

In [4]:
from bloqade.qasm2.emit import QASM2 # the QASM2 target
from bloqade.qasm2.parse import pprint # the QASM2 pretty printer

target = QASM2()
ast = target.emit(ghz_linear(2))
pprint(ast)

[90mOPENQASM 2.0[0m;
[31minclude[0m [32m"qelib1.inc"[0m;
[31mqreg[0m qreg[4];
[36mh[0m [36mqreg[0m[[39m0[0m];
[31mCX[0m [36mqreg[0m[[39m0[0m], [36mqreg[0m[[39m1[0m];
[31mCX[0m [36mqreg[0m[[39m1[0m], [36mqreg[0m[[39m2[0m];
[31mCX[0m [36mqreg[0m[[39m2[0m], [36mqreg[0m[[39m3[0m];


CNOTs are represented by 'CX' in QASM2, while the Hadamard gate is just 'h'. 

At first sight, this does not look like exactly the circuit we wanted. Yet, notice that in the structure of the circuit above, one is growing the GHZ state qubit-by-qubit. The first qubit is just brought into a $|+ \rangle$ state, which would be our 1-qubit GHZ state. The next CNOT brings us to the 2-qubit GHZ state  $\sim |00 \rangle+|11 \rangle$ (also known as a Bell state). Now notice that the qubits in each register are equivalent. This means that instead of using $q_0$ as the control for the next layer, we could just as well use $q_1$. This symmetry propagates throughout and we can choose any qubit for control as long as it has previously been at target. This cascading of controls is what see in the QASM representation above.

## Log-depth Implementation of a GHZ State Preparation Circuit

While the circuit above is perfectly functional to the purposes of generating a GHZ state in theory, it is highly unoptimized. We should strive to make the most of a quantum computer's resources, so let's look how we can make it better.

From the original protocol (always using $q_0$ as the control), we see that the CNOTs do not have a preferential order. This should not come as a surprise, again from the fact that the GHZ state is invariant under qubit permutations. Thus, first communting the CNOTs accordingly, and then cascading the controls, we realize that it is possible to convert the serial structure of cascading CNOTs above into the following log(N) circuit [(see *Mooney, White, Hill, Hollenberg* - 2021)](https://arxiv.org/abs/2101.08946).

<div align="center">
<picture>
   <img src="GHZ_parallel.png" style="width: 25vw; height: 25vw;" >
</picture>
</div>

Naturally, the number of entangling gates did not change. But now, since a CNOT between $q_0$ and $q_1$ commutes with a CNOT between $q_3$ and $q_5$, those gates can appear in parallel in the circuit, and so on. This makes its logical depth shorter, in fact logarithmic in the number of qubits.

# Circuit vs. Execution Depth

Before going any further, it's worth distinguishing between the concept of circuit depth and circuit execution depth. Just because the circuit above looks shallower does not mean that a quantum computer would implement it physically like so. That depends on the possible connectivities, and also on whether instructions can be sent to the computer that will tell which gates can be done in parallel.

So let's try to address that. First, let's just revise the implementation of the circuit on Bloqade. We can achieve a log-depth GHZ preparation circuit like the above via the following workflow:

In [5]:
def ghz_log_depth(n: int):
    n_qubits = int(2**n)

    @qasm2.extended
    def layer_of_cx(i_layer: int, qreg: qasm2.QReg):
        # count layer and deploy CNOT gates accordingly
        step = n_qubits // (2**i_layer)
        for j in range(0, n_qubits, step):
            qasm2.cx(ctrl=qreg[j], qarg=qreg[j + step // 2])

    @qasm2.extended
    def ghz_log_depth_program():

        qreg = qasm2.qreg(n_qubits)
        # add starting Hadamard and build layers
        qasm2.h(qreg[0])
        for i in range(n):
            layer_of_cx(i_layer=i, qreg=qreg)

    return ghz_log_depth_program

Again, let's print the QASM2 script for a small instance of the circuit:

In [6]:
target = QASM2()
ast = target.emit(ghz_log_depth(2))
pprint(ast)

[90mOPENQASM 2.0[0m;
[31minclude[0m [32m"qelib1.inc"[0m;
[31mqreg[0m qreg[4];
[36mh[0m [36mqreg[0m[[39m0[0m];
[31mCX[0m [36mqreg[0m[[39m0[0m], [36mqreg[0m[[39m2[0m];
[31mCX[0m [36mqreg[0m[[39m0[0m], [36mqreg[0m[[39m1[0m];
[31mCX[0m [36mqreg[0m[[39m2[0m], [36mqreg[0m[[39m3[0m];


It should be clear that the number of gates is exactly as in the serial implementation above. The difference is that the last two CNOTs commute with each other and can be deployed in parallel. Still, each CNOT gate instruction inside our for-loop above is executed in sequence. Nothing in QASM can indicate for the quantum computer that some of these gates are parallelizable. Even thought the circuit depth is $log(N) = n$, the QASM execution depth is still $N$.

## Our Native Gate Set and Parallelism

Now we can fix the execution parallelization. By nature, neutral-atom quantum computers can execute certain native gates in parallel in a single instruction/ execution cycle. The concept is very similar to the SIMD (Single Instruction, Multiple Data) in classical computing.

On our hardware, there are two important factors to be considered:
1. the native gate set allows for arbitrary (parallel) single-qubit rotations and (parallel) 2-qubit CZ gates.
2. Our atom shuttling architecture allows arbitrary qubit connectivity. This means that our parallel instruction is not limited to fixed connectivity (for example nearest neighbor connectivity).

When we say "in parallel" we mean that the gates can be deployed simultaneously over many qubits, but also that the gates to be deployed have to be _exactly_ the same. In practice, how many gates can really be deployed in parallel, and how efficient atom transport or laser targeting is for that realization, depends on the given hardware capabilities. We can ignore that for the time being, for pedagogical purposes.

With the above in mind, we can rewrite the `layer` subroutine above to now use the `qasm2.parallel` dialect in Bloqade. But we also have to choose gates within our native hardware-deployable set. CNOT gates can be decomposed into CZ gates with $R_y(-\pi/2)$ and $R_y(\pi/2)$ single-qubit gates flanking the target qubits. $R_y(\theta)$ rotations can be represented natively for us in terms of $U3(\theta,\phi,\lambda)$ gates for which we follow standard convention such as [this](https://docs.quantum.ibm.com/api/qiskit/0.41/qiskit.circuit.library.U3Gate). We thus have $R_y(\theta) \equiv U3(\theta,0,0)$. The Hadamard can also be represented via such rotations, now as $H \equiv U3(\pi/2,0,\pi)$. This decomposition brings us to our native gate set.

Bloqade deploys these natively parallelizable gates via the `parallel.u` and `parallel.cz` gate representations. So let's evolve the code above one more time:

In [7]:
def ghz_log_simd(n: int):
    n_qubits = int(2**n)

    @qasm2.extended
    def layer(i_layer: int, qreg: qasm2.QReg):
        step = n_qubits // (2**i_layer)

        def get_qubit(x: int):
            return qreg[x]

        ctrl_qubits = ilist.map(fn=get_qubit, collection=range(0, n_qubits, step))
        targ_qubits = ilist.map(
            fn=get_qubit, collection=ilist.range(step // 2, n_qubits, step)
        )

        # Ry(-pi/2)
        qasm2.parallel.u(qargs=targ_qubits, theta=-math.pi / 2, phi=0.0, lam=0.0)

        # CZ gates
        qasm2.parallel.cz(ctrls=ctrl_qubits, qargs=targ_qubits)

        # Ry(pi/2)
        qasm2.parallel.u(qargs=targ_qubits, theta=math.pi / 2, phi=0.0, lam=0.0)

    @qasm2.extended
    def ghz_log_depth_program():

        qreg = qasm2.qreg(n_qubits)

        qasm2.u3(qarg=qreg[0], theta=math.pi / 2, phi=0.0, lam=math.pi)
        for i in range(n):
            layer(i_layer=i, qreg=qreg)

    return ghz_log_depth_program

For simplicity, we used a non-parallelizable rewrite of the Hadamard, as we know we have a single one.

As usual, let's look at what QASM2 gives us:

In [8]:
target = QASM2()
ast = target.emit(ghz_log_simd(2))
pprint(ast)

[90mOPENQASM 2.0[0m;
[31minclude[0m [32m"qelib1.inc"[0m;
[31mqreg[0m qreg[4];
[31mU[0m(1.5707963267948966, 0.0, 3.141592653589793) [36mqreg[0m[[39m0[0m];
[31mU[0m(-1.5707963267948966, 0.0, 0.0) [36mqreg[0m[[39m2[0m];
[36mcz[0m [36mqreg[0m[[39m0[0m], [36mqreg[0m[[39m2[0m];
[31mU[0m(1.5707963267948966, 0.0, 0.0) [36mqreg[0m[[39m2[0m];
[31mU[0m(-1.5707963267948966, 0.0, 0.0) [36mqreg[0m[[39m3[0m];
[31mU[0m(-1.5707963267948966, 0.0, 0.0) [36mqreg[0m[[39m1[0m];
[36mcz[0m [36mqreg[0m[[39m0[0m], [36mqreg[0m[[39m1[0m];
[36mcz[0m [36mqreg[0m[[39m2[0m], [36mqreg[0m[[39m3[0m];
[31mU[0m(1.5707963267948966, 0.0, 0.0) [36mqreg[0m[[39m3[0m];
[31mU[0m(1.5707963267948966, 0.0, 0.0) [36mqreg[0m[[39m1[0m];


The above looks more complicated than before, naturally, but we are pursuing transpiling the circuit to a hardware-compatible gate set and that has constraints. The gain is that if we keep parallelism in the final circuit, we can reduce not only the logic steps to be logarithmic in the number of qubits $N$, but also the _execution_ depth of the circuit.

So if feels we are going in the right direction. But nothing here indicates to the quantum computer which gates among the above can, or should, be operated in parallel. 

Well, in truth, the code above covers that, but we just didn't make it explicit. To do so, we have to part ways with QASM2 representations. With Bloqade, we are able to extend QASM2 into a version that can only be parsed by neutral-atom devices, and which accepts instructions to indicate parallelization. The code above already built - by hand - instructions that were deployable in parallel via the declaration of `parallel.u` and `parallel.cz` with multiple targets. These instructions come up as delimiters `{}`, but to ensure that output instructions from Bloqade won't conflict with other QASM, we omit those by default. Showing where they appear can be achieved by simply setting the `allow_parallel` boolean at `QASM2()` to `True`:

In [9]:
target = QASM2( allow_parallel=True)
ast = target.emit(ghz_log_simd(2))
pprint(ast)

[90mKIRIN {func,lowering.call,lowering.func,py.ilist,qasm2.core,qasm2.expr,qasm2.indexing,qasm2.parallel,qasm2.uop,scf}[0m;
[31minclude[0m [32m"qelib1.inc"[0m;
[31mqreg[0m qreg[4];
[31mU[0m(1.5707963267948966, 0.0, 3.141592653589793) [36mqreg[0m[[39m0[0m];
[31mparallel.U[0m(-1.5707963267948966, 0.0, 0.0) {
[90m  [0m[36mqreg[0m[[39m2[0m];
}
[31mparallel.CZ [0m{
[90m  [0m[36mqreg[0m[[39m0[0m], [36mqreg[0m[[39m2[0m];
}
[31mparallel.U[0m(1.5707963267948966, 0.0, 0.0) {
[90m  [0m[36mqreg[0m[[39m2[0m];
}
[31mparallel.U[0m(-1.5707963267948966, 0.0, 0.0) {
[90m  [0m[36mqreg[0m[[39m1[0m];
[90m  [0m[36mqreg[0m[[39m3[0m];
}
[31mparallel.CZ [0m{
[90m  [0m[36mqreg[0m[[39m0[0m], [36mqreg[0m[[39m1[0m];
[90m  [0m[36mqreg[0m[[39m2[0m], [36mqreg[0m[[39m3[0m];
}
[31mparallel.U[0m(1.5707963267948966, 0.0, 0.0) {
[90m  [0m[36mqreg[0m[[39m1[0m];
[90m  [0m[36mqreg[0m[[39m3[0m];
}


This looks exactly like what we were looking for, albeit missing some parallelization opportunities (notice that the $R_y(-\pi/2)$ gate on qubits 1 and 3 could have been commuted to the left of the CZ from 0 to 2, as well as to the left of the $R_y(\pi/2)$ on qubit 2). 

## Automatic compilation

We achieved what we wanted, but the process was maybe a bit too manual. Bloqade, in fact, has tools to automate the rewrite of circuits in neutral-atom-native gate sets, as well as a heuristic to improve the parallelization of a circuit. Let's learn to use these features using our GHZ state preparation example.

In order to do that, we should learn a bit about dialect groups and our compilation infrastructure. By now, we have learned how our functions could be decorated with `@qasm.extended` syntax that allowed us to compose functions, use control flow, and even define parallelizable instructions for a quantum computer. This extended QASM is considered a dialect of Bloqade, a sub-language (eDSL). Now we will create our dialect group to help automate the parallelization and compilation process. For that, we use tools from our Kirin compiler toolchain and create a new compiler! Without further ado, the code below allows for the creation of what we need.

In [12]:
from bloqade.qasm2.rewrite.native_gates import RydbergGateSetRewriteRule
from kirin import ir
from kirin.rewrite import Walk
from bloqade.qasm2.passes import UOpToParallel, QASM2Fold


@ir.dialect_group(qasm2.extended)
def extended_opt(self):
    native_rewrite = Walk(RydbergGateSetRewriteRule(self)) # use Kirin's functionality to walk code line by line while applying neutral-atom gate decomposition as defined in Bloqade
    parallelize_pass = UOpToParallel(self) # review the code and apply parallelization using a heuristic
    agg_fold = QASM2Fold(self) # supports parallelization by unfolding loops to search for parallelization opportunities

    # here we define our new compiler pass
    def run_pass(
        kernel: ir.Method,
        *,
        fold: bool = True,
        typeinfer: bool = True,
        parallelize: bool = False,
    ):
        assert qasm2.extended.run_pass is not None
        qasm2.extended.run_pass(kernel, fold=fold, typeinfer=typeinfer) # apply the original run_pass to the lowered kernel
        native_rewrite.rewrite(kernel.code) # decompose all gates in the circuit to neutral atom gate set

        # here goes our parallelization optimizer; the order of the commands here matters!
        if parallelize:
            agg_fold.fixpoint(kernel)
            parallelize_pass(kernel)

    return run_pass

Now the process above has nothing to do with quantum computing, but everything to do with compilers. In practice, we are creating a new decorator that we can use as an interpreter for our kernels. This new decorator is itself decorated by `@ir.dialect_group(qasm2.extended)` which will define a new dialect group in `qasm2.extended`.

The comments in the code identify the main steps to transpile the gate set and then run an optimizer to seek opportunities for hardware-level parallelizations. 

Let's see this in action. We return to our original log-depth circuit and simply decorate our functions so they are interpreted by this new compiler pass:

In [13]:
def ghz_log_depth_2(n: int, parallelize: bool = True):
    n_qubits = int(2**n)

    @extended_opt
    def layer_of_cx(i_layer: int, qreg: qasm2.QReg):
        step = n_qubits // (2**i_layer)
        for j in range(0, n_qubits, step):
            qasm2.cx(ctrl=qreg[j], qarg=qreg[j + step // 2])


    @extended_opt(parallelize=parallelize)
    def ghz_log_depth_program():

        qreg = qasm2.qreg(n_qubits)

        qasm2.h(qreg[0])
        for i in range(n):
            layer_of_cx(i_layer=i, qreg=qreg)

    return ghz_log_depth_program

Note, no mentions of CZs here, and the Hadamard gate is just declared as we expect it to be from our "pen-and-paper" logic. Yet, when we look at the output QASM:

In [14]:
target = qasm2.emit.QASM2(
    allow_parallel=True,
)
ast = target.emit(ghz_log_depth_2(2, parallelize=True))
qasm2.parse.pprint(ast)

explicit `use_repetition_ids`, e.g., to preserve current behavior, use

  CircuitOperations(..., use_repetition_ids=True)
explicit `use_repetition_ids`, e.g., to preserve current behavior, use

  CircuitOperations(..., use_repetition_ids=True)
explicit `use_repetition_ids`, e.g., to preserve current behavior, use

  CircuitOperations(..., use_repetition_ids=True)
explicit `use_repetition_ids`, e.g., to preserve current behavior, use

  CircuitOperations(..., use_repetition_ids=True)
explicit `use_repetition_ids`, e.g., to preserve current behavior, use

  CircuitOperations(..., use_repetition_ids=True)
explicit `use_repetition_ids`, e.g., to preserve current behavior, use

  CircuitOperations(..., use_repetition_ids=True)
explicit `use_repetition_ids`, e.g., to preserve current behavior, use

  CircuitOperations(..., use_repetition_ids=True)
explicit `use_repetition_ids`, e.g., to preserve current behavior, use

  CircuitOperations(..., use_repetition_ids=True)
explicit `use_repetition

[90mKIRIN {func,lowering.call,lowering.func,py.ilist,qasm2.core,qasm2.expr,qasm2.indexing,qasm2.parallel,qasm2.uop,scf}[0m;
[31minclude[0m [32m"qelib1.inc"[0m;
[31mqreg[0m qreg[4];
[31mU[0m(1.5707963267949, 0.0, 3.14159265358979) [36mqreg[0m[[39m0[0m];
[31mparallel.U[0m(1.5707963267949, 0.0, 6.28318530717959) {
[90m  [0m[36mqreg[0m[[39m2[0m];
[90m  [0m[36mqreg[0m[[39m1[0m];
[90m  [0m[36mqreg[0m[[39m3[0m];
}
[36mcz[0m [36mqreg[0m[[39m0[0m], [36mqreg[0m[[39m2[0m];
[31mU[0m(1.5707963267949, 3.14159265358979, 3.14159265358979) [36mqreg[0m[[39m2[0m];
[31mU[0m(0.0, 0.0, 3.14159265358979) [36mqreg[0m[[39m0[0m];
[31mU[0m(0.0, 0.0, 6.28318530717958) [36mqreg[0m[[39m2[0m];
[36mcz[0m [36mqreg[0m[[39m0[0m], [36mqreg[0m[[39m1[0m];
[31mU[0m(1.5707963267949, 3.14159265358979, 3.14159265358979) [36mqreg[0m[[39m1[0m];
[31mU[0m(0.0, 0.0, 3.14159265358979) [36mqreg[0m[[39m0[0m];
[31mU[0m(0.0, 0.0, 6.28318530717958) [36

This is not too bad! $U3(\pi/2,\pi,\pi)$ is nothing but another way of writing $R_y(-\pi/2)$ and the $2\pi$ angles can really be just ignored; they result from numerics and are effectivly the same as 0. $U3(\pi/2,0,\pi)$ is still our Hadamard.

We also achieve some degree of parallelization via our greedy optimizer. The three $R_y(-\pi/2)$ sit nicely together between our braces and the circuit is indeed valid. But we are missing out on several opportunities: the CZs are not coming together as they should, and this circuit is also missing on aligning the final layer of $R_y(\pi/2)$ that can be deployed together. So we still got some work to do!

## Barriers for the win

To help our greedy optimizer ensure that it will generate circuits satisfying certain rules that we know are desirable for a given implementation, we can deploy barriers. In our case, we define barriers between groups of compatible CNOT gates, and then just deploy our new compiler pass as before:

In [15]:
def ghz_log_depth_3(n: int):
    n_qubits = int(2**n)

    @extended_opt
    def layer_of_cx(i_layer: int, qreg: qasm2.QReg):
        step = n_qubits // (2**i_layer)
        for j in range(0, n_qubits, step):
            qasm2.cx(ctrl=qreg[j], qarg=qreg[j + step // 2])
            qasm2.barrier((qreg[j], qreg[j + step // 2]))


    @extended_opt(parallelize=True)
    def ghz_log_depth_program():

        qreg = qasm2.qreg(n_qubits)

        qasm2.h(qreg[0])
        for i in range(n):
            layer_of_cx(i_layer=i, qreg=qreg)

    return ghz_log_depth_program

And checking the outcome,

In [16]:
target = qasm2.emit.QASM2(
    allow_parallel=True,
)
ast = target.emit(ghz_log_depth_3(2))
qasm2.parse.pprint(ast)

explicit `use_repetition_ids`, e.g., to preserve current behavior, use

  CircuitOperations(..., use_repetition_ids=True)
explicit `use_repetition_ids`, e.g., to preserve current behavior, use

  CircuitOperations(..., use_repetition_ids=True)
explicit `use_repetition_ids`, e.g., to preserve current behavior, use

  CircuitOperations(..., use_repetition_ids=True)
explicit `use_repetition_ids`, e.g., to preserve current behavior, use

  CircuitOperations(..., use_repetition_ids=True)
explicit `use_repetition_ids`, e.g., to preserve current behavior, use

  CircuitOperations(..., use_repetition_ids=True)
explicit `use_repetition_ids`, e.g., to preserve current behavior, use

  CircuitOperations(..., use_repetition_ids=True)
explicit `use_repetition_ids`, e.g., to preserve current behavior, use

  CircuitOperations(..., use_repetition_ids=True)
explicit `use_repetition_ids`, e.g., to preserve current behavior, use

  CircuitOperations(..., use_repetition_ids=True)
explicit `use_repetition

[90mKIRIN {func,lowering.call,lowering.func,py.ilist,qasm2.core,qasm2.expr,qasm2.indexing,qasm2.parallel,qasm2.uop,scf}[0m;
[31minclude[0m [32m"qelib1.inc"[0m;
[31mqreg[0m qreg[4];
[31mU[0m(1.5707963267949, 0.0, 3.14159265358979) [36mqreg[0m[[39m0[0m];
[31mparallel.U[0m(1.5707963267949, 0.0, 6.28318530717959) {
[90m  [0m[36mqreg[0m[[39m2[0m];
[90m  [0m[36mqreg[0m[[39m1[0m];
[90m  [0m[36mqreg[0m[[39m3[0m];
}
[36mcz[0m [36mqreg[0m[[39m0[0m], [36mqreg[0m[[39m2[0m];
[31mU[0m(1.5707963267949, 3.14159265358979, 3.14159265358979) [36mqreg[0m[[39m2[0m];
[31mU[0m(0.0, 0.0, 3.14159265358979) [36mqreg[0m[[39m0[0m];
[31mU[0m(0.0, 0.0, 6.28318530717958) [36mqreg[0m[[39m2[0m];
[31mbarrier[0m [36mqreg[0m[[39m0[0m], [36mqreg[0m[[39m2[0m];
[31mparallel.CZ [0m{
[90m  [0m[36mqreg[0m[[39m0[0m], [36mqreg[0m[[39m1[0m];
[90m  [0m[36mqreg[0m[[39m2[0m], [36mqreg[0m[[39m3[0m];
}
[31mparallel.U[0m(1.5707963267949, 3.1

Now we got it! The barriers go in the right places, operations are being combined in a way that maximizes parallelization, and the circuit is written in terms of native gates only! We achieved log-depth performance that can be deployed in neutral-atom quantum hardware!

## Running the qubit simulator

Let's check how well this works. We use the `bloqade-pyqrack` package to run a qubit simulator alongside `bloqade`. 

(For installation instructions please read the [README](https://github.com/QuEraComputing/bloqade-pyqrack?tab=readme-ov-file#which-extra-do-i-install) of bloqade-pyqrack. As a note Mac OS doesn't support OpenCL but some systems do support CUDA.)

To run the simulation, we finish our QASM program by including measurement instructions.

In [17]:
def ghz_log_depth(n: int, parallelize: bool = True):
    n_qubits = int(2**n)

    @extended_opt
    def layer_of_cx(i_layer: int, qreg: qasm2.QReg):
        step = n_qubits // (2**i_layer)
        for j in range(0, n_qubits, step):
            qasm2.cx(ctrl=qreg[j], qarg=qreg[j + step // 2])
            qasm2.barrier((qreg[j], qreg[j + step // 2]))


    @extended_opt(parallelize=parallelize)
    def ghz_log_depth_program():

        qreg = qasm2.qreg(n_qubits)
        creg = qasm2.creg(n_qubits)

        qasm2.h(qreg[0])
        for i in range(n):
            layer_of_cx(i_layer=i, qreg=qreg)
            
        for i in range(n_qubits):
            qasm2.measure(qreg[i],creg[i])
            
        return creg # return register for simulation

    return ghz_log_depth_program


kernel = ghz_log_depth(2, parallelize=False)

explicit `use_repetition_ids`, e.g., to preserve current behavior, use

  CircuitOperations(..., use_repetition_ids=True)
explicit `use_repetition_ids`, e.g., to preserve current behavior, use

  CircuitOperations(..., use_repetition_ids=True)
explicit `use_repetition_ids`, e.g., to preserve current behavior, use

  CircuitOperations(..., use_repetition_ids=True)
explicit `use_repetition_ids`, e.g., to preserve current behavior, use

  CircuitOperations(..., use_repetition_ids=True)
explicit `use_repetition_ids`, e.g., to preserve current behavior, use

  CircuitOperations(..., use_repetition_ids=True)
explicit `use_repetition_ids`, e.g., to preserve current behavior, use

  CircuitOperations(..., use_repetition_ids=True)
explicit `use_repetition_ids`, e.g., to preserve current behavior, use

  CircuitOperations(..., use_repetition_ids=True)
explicit `use_repetition_ids`, e.g., to preserve current behavior, use

  CircuitOperations(..., use_repetition_ids=True)
explicit `use_repetition

and now we activate our simulation pipeline

In [18]:
from bloqade.pyqrack import PyQrack
from collections import Counter

device = PyQrack(dynamic_qubits=True, pyqrack_options={"isBinaryDecisionTree": False})
results = device.multi_run(kernel, _shots=100)

def to_bitstrings(results):
    return Counter(map(lambda result:"".join(map(str, result)), results))

counts = to_bitstrings(results)

for key, value in counts.items():
    print(key, value)

 No platforms found. Check OpenCL installation!
1111 47
0000 53


In this case we are just sampling the circuit and the results seem pretty good!

## Adding Noise to the simulation

To wrap up, Bloqade accepts the definition of a heuristic noise model via `NoisePass`. This allows you to inject noise based on the gate type via specific noise parameters. You can even incorporate the injection extra noise due to atom moving (for example in current hardware CZ gates). Bloqade already contains a basic model based on some very simple heuristics and a two row zone layout for the storage and the gate zones implemented by default in `NoisePass`. 

In [19]:
from bloqade.qasm2.passes import NoisePass
from bloqade.noise import native

# add noise
noise_kernel = kernel.similar()
extended_opt.run_pass(noise_kernel, parallelize=True)
NoisePass(extended_opt)(noise_kernel)

noise_kernel = noise_kernel.similar(extended_opt.add(native))

And now running the simulation:

In [21]:
device = PyQrack(dynamic_qubits=True, pyqrack_options={"isBinaryDecisionTree": False})
results = device.multi_run(noise_kernel, _shots=1000)


counts = to_bitstrings(results)

for key, value in counts.items():
    print(key, value)

0000 479
1111 476
1110 4
1011 5
1100 9
1000 1
0001 6
0011 8
0100 5
0010 4
0111 1
1101 2


As expected, we start leaking out of the GHZ subspace. 

This concludes our tutorial!