# Advanced - How to Define Custom Gates
In an earlier example notebook, we showed you the basics of defining custom gates. The approach demonstrated there may already be enough for your purposes, and potentially already as efficient as possible. In this notebook, we will discuss the considerations that go into making high-performing gates. 

In [1]:
using PauliPropagation

### The SWAP example

Let us again consider the SWAP gate example

In [2]:
struct CustomSWAPGate <: StaticGate
    qinds::Tuple{Int, Int}  # The two sites to be swapped
end

Again define the action,

In [3]:
function PauliPropagation.apply(gate::CustomSWAPGate, pstr, coeff; kwargs...)
    # get the Pauli on the first site
    pauli1 = getpauli(pstr, gate.qinds[1])
    # get the Pauli on the second site
    pauli2 = getpauli(pstr, gate.qinds[2])
    
    # set the Pauli on the first site to the second Pauli
    pstr = setpauli(pstr, pauli2, gate.qinds[1])
    # set the Pauli on the second site to the first Pauli
    pstr = setpauli(pstr, pauli1, gate.qinds[2])

    # apply() is always expected to return a tuple of (pstr, coeff) tuples
    return tuple((pstr, coeff))
end

Now set up a bigger simulation with 25 qubits on a 5 by 5 grid.

In [4]:
nx = 5
ny = 5
nq = nx * ny

topology = rectangletopology(nx, ny);

`nl` layers of a circuit consisting of `RX` and `RZZ` Pauli rotations.

In [5]:
nl = 3
base_circuit = tfitrottercircuit(nq, nl; topology=topology);
nparams = countparameters(base_circuit)

195

Define our observable as $ Z_7 Z_{13} $.

In [6]:
pstr = PauliString(nq, [:Z, :Z], [7, 13])

PauliString(nqubits: 25, 1.0 * IIIIIIZIIIIIZIIIIIII...)

Circuit parameters with a random seed.

In [7]:
using Random
Random.seed!(42)
thetas = randn(nparams);

For this notebook, we will use a minimum coefficient threshold. The results are still almost exact in this simple case.

In [8]:
min_abs_coeff = 5e-3

0.005

Now add a 1D line of SWAP gates after the first and second layer of gates in the base circuit.

In [9]:
nparams_per_layer = Int(length(base_circuit)/nl)

65

In [10]:
ourSWAP_circuit = deepcopy(base_circuit);
# first the second layer so the insertion indices don't change 
for qind in 1:(nq-1)
    insert!(ourSWAP_circuit, 2*nparams_per_layer, CustomSWAPGate((qind, qind+1)))
end
for qind in 1:(nq-1)
    insert!(ourSWAP_circuit, nparams_per_layer, CustomSWAPGate((qind, qind+1)))
end

Run the circuit

In [11]:
@time ourSWAP_psum = propagate(ourSWAP_circuit, pstr, thetas; min_abs_coeff=min_abs_coeff)

  0.375881 seconds (785.42 k allocations: 37.534 MiB, 99.65% compilation time)


PauliSum(nqubits: 25, 576 Pauli terms:
 0.0053683 * IIIIYZIIIXYIIIZIIIII...
 -0.0069415 * ZIIZYXZIIZXXZIIIZIII...
 0.080065 * IIIZYIIIIIYIIIIIIIII...
 -0.0085018 * IIIIIZIIIXYIIIIIIIII...
 -0.03761 * IIIZXZIIIZZIIIIIIIII...
 -0.0063894 * IIIZXIIIIZYYZIIIZIII...
 0.022245 * IIIZYIIIIZYZIIIIIIII...
 0.0070561 * IIIZXXIIIXXXZIZIZIII...
 0.0071172 * IIIZYZIIIZYYZIIZZIII...
 -0.00951 * IIIIIZZIIYXXZIZZYIII...
 0.007368 * IIIZXIIIIXYIIIZIIIII...
 0.093232 * IIIZXIIIIXXIIIZZIIII...
 0.016597 * IIIIZZIIIXYIIIZIIIII...
 -0.0058598 * ZIIIIXIIIYXZIIZIIIII...
 0.0086098 * IIIIZIIIIXXIIIZZIIII...
 0.046612 * IIIZXIIIIZXIIIIZIIII...
 0.012269 * IIIZXZIIIZXIIIIIIIII...
 0.056603 * IIIIIZZIIYXXZIZIZIII...
 -0.029879 * IIIIZXZIIIZZIIIIIIII...
 0.015797 * ZIIIIYIIIYXXZIZIZIII...
  ⋮)

Overlap with the zero-state:

In [12]:
overlapwithzero(ourSWAP_psum)

-0.7211301948203009

We already mentioned that `PauliPropagation.jl` contains a `CliffordGate` implementation of SWAP. Let's implement the same thing and compare performance.

In [13]:
cliffSWAP_circuit = deepcopy(base_circuit);
for qind in 1:(nq-1)
    insert!(cliffSWAP_circuit, 2*nparams_per_layer, CliffordGate(:SWAP, (qind, qind+1)))
end
for qind in 1:(nq-1)
    insert!(cliffSWAP_circuit, nparams_per_layer, CliffordGate(:SWAP, (qind, qind+1)))
end

In [14]:
@time cliffSWAP_psum = propagate(cliffSWAP_circuit, pstr, thetas; min_abs_coeff=min_abs_coeff);

  0.042016 seconds (82.73 k allocations: 4.122 MiB, 97.09% compilation time)


Are the results the same?

In [15]:
overlapwithzero(cliffSWAP_psum)

-0.7211301948203009

In [16]:
cliffSWAP_psum == ourSWAP_psum

true

Yes!

We can also benchmark the performance.

In [17]:
using BenchmarkTools

In [18]:
@btime propagate($ourSWAP_circuit, $pstr, $thetas; min_abs_coeff=$min_abs_coeff);

  1.103 ms (1030 allocations: 158.00 KiB)


In [19]:
@btime propagate($cliffSWAP_circuit, $pstr, $thetas; min_abs_coeff=$min_abs_coeff);

  1.145 ms (1030 allocations: 158.00 KiB)


No downside at all from defining our custom gate. How? This is because the `apply` function for this gate is *type stable*! Type stability is absolutely crucial in Julia, and codes live and die by it.

In [20]:
@code_warntype apply(CustomSWAPGate((7, 8)), pstr.term, 0.0, 1.0)

MethodInstance for apply(::CustomSWAPGate, ::UInt64, ::Float64, ::Float64)
  from apply([90mgate[39m::[1mSG[22m, [90mpstr[39m, [90mcoeff[39m, [90mtheta[39m; kwargs...) where SG<:StaticGate[90m @[39m [90mPauliPropagation[39m [90m~/.julia/dev/PauliPropagation/src/Propagation/[39m[90m[4mgenerics.jl:172[24m[39m
Static Parameters
  SG = [36mCustomSWAPGate[39m
Arguments
  #self#[36m::Core.Const(PauliPropagation.apply)[39m
  gate[36m::CustomSWAPGate[39m
  pstr[36m::UInt64[39m
  coeff[36m::Float64[39m
  theta[36m::Float64[39m
Body[36m::Tuple{Tuple{UInt64, Float64}}[39m
[90m1 ─[39m %1 = PauliPropagation.:(var"#apply#118")[36m::Core.Const(PauliPropagation.var"#apply#118")[39m
[90m│  [39m %2 = Core.NamedTuple()[36m::Core.Const(NamedTuple())[39m
[90m│  [39m %3 = Base.pairs(%2)[36m::Core.Const(Base.Pairs{Symbol, Union{}, Tuple{}, @NamedTuple{}}())[39m
[90m│  [39m %4 = (%1)(%3, #self#, gate, pstr, coeff, theta)[36m::Tuple{Tuple{UInt64, Float64}}[39m


All blue means that everything is great! If correctly implemented, `apply` will be type stable if it returns a known number of Pauli and coefficient pairs. Here it is just 1 because it is a Clifford gate.

### A gate that branches into more than one Pauli string

Onto an example of a gate that can _split_ a Pauli string into two: The `T` gate.

In [21]:
struct CustomTGate <: StaticGate
    qind::Int
end

A `T` gate is a non-Clifford gate that commutes with `I` and `Z`, splits `X` into `cos(π/4)X - sin(π/4)Y`, and `Y` into `cos(π/4)Y + sin(π/4)X` (in the Heisenberg picture). 

Let's write the code for that.

In [22]:
function PauliPropagation.apply(gate::CustomTGate, pstr, coeff; kwargs...)
    # get the Pauli on the site `gate.qind`
    pauli = getpauli(pstr, gate.qind)
    
    if pauli == 0 || pauli == 3  # I or Z commute
        # return a tuple of one (pstr, coeff) tuple
        return tuple((pstr, coeff))     
    end
    
    if pauli == 1 # X goes to X, -Y
        new_pauli = 2  # Y
        # set the Pauli
        new_pstr = setpauli(pstr, new_pauli, gate.qind)
        # adapt the coefficients
        new_coeff = -1 * coeff * sin(π/4)
        
    else # Y goes to Y, X
        new_pauli = 1  # X
        # set the Pauli
        new_pstr = setpauli(pstr, new_pauli, gate.qind)
        # adapt the coefficients
        new_coeff = coeff * sin(π/4)
    end

    updated_coeff = coeff * cos(π/4)

    # return a tuple of two (pstr, coeff) tuples
    return tuple((pstr, updated_coeff), (new_pstr, new_coeff))
    
end

Insert a layer of `TGate`s after the first layer of the base circuit.

In [23]:
ourT_circuit = deepcopy(base_circuit);
for qind in 1:nq
    insert!(ourT_circuit, 2*nparams_per_layer, CustomTGate(qind))
end
for qind in 1:nq
    insert!(ourT_circuit, nparams_per_layer, CustomTGate(qind))
end

And run:

In [24]:
@time ourT_psum = propagate(ourT_circuit, pstr, thetas; min_abs_coeff=min_abs_coeff)

  0.040504 seconds (72.30 k allocations: 3.653 MiB, 89.10% compilation time)


PauliSum(nqubits: 25, 1702 Pauli terms:
 0.014863 * IIIIIIZZIIIIXZIIIXZI...
 -0.032827 * IZIIIZYZIIIZYZIIIZII...
 0.0058029 * IZZIIIIYIIIZXZIIIZII...
 0.0054978 * IZIIIYXZIIIXXIIIZZII...
 -0.014213 * IIIIIIXIZIIZYYIIIYII...
 -0.0073527 * IIIIIIIIZIIYXYIIZXII...
 0.0075631 * IIIIIIIIIIIXXZIIIYZI...
 -0.0090057 * IZIIIIZYIIIZXIIIIZII...
 -0.053549 * IIIIIZXZIIIZXZIIIZII...
 0.01401 * IZIIIYXYIIIZYZIIIZII...
 0.0097479 * IIIIIIXIZIIIXYIIIZZI...
 0.019155 * IIIIIIXIIIIZXZIIIXZI...
 0.011572 * IIIIIIYZZIIXXXZIZZZI...
 0.0090639 * IIIIIZXIIIIYZIIIZYII...
 -0.0069645 * IZIIIXIZIIIZYIIIIZII...
 0.0061054 * IIIIIXIIIIIIZIIIIIII...
 0.031437 * IZIIIZYIIIIIXZIIIZII...
 -0.0056792 * IIIIIIZZZIIXYYIIZYII...
 0.005949 * IZIIIIYZIIIXXZIIZYZI...
 -0.0075995 * IIIIIIYZIIIIYIIIIYZI...
  ⋮)

In [25]:
overlapwithzero(ourT_psum)

0.33262899358840403

But did it work? Again, we have an implementation of a `TGate` in our library. In case you are interested, we currently implement `T` gates as Pauli `Z` rotations at an angle of `π/4`. Let's compare to that.

In [26]:
libraryT_circuit = deepcopy(base_circuit);
for qind in 1:nq
    insert!(libraryT_circuit, 2*nparams_per_layer, TGate(qind))
end
for qind in 1:nq
    insert!(libraryT_circuit, nparams_per_layer, TGate(qind))
end

If you call `PauliGate(:Z, qind, parameter)`, this will create a so-called `FrozenGate` wrapping the parametrized `PauliGate`, with a fixed `parameter` at the time of circuit construction.

Run it and compare

In [27]:
@time libraryT_psum = propagate(libraryT_circuit, pstr, thetas; min_abs_coeff=min_abs_coeff);

  0.009682 seconds (9.65 k allocations: 766.672 KiB, 54.55% compilation time)


In [28]:
overlapwithzero(libraryT_psum)

0.33262899358840403

In [29]:
libraryT_psum == ourT_psum

true

It works! But is it optimal?

In [30]:
using BenchmarkTools

In [31]:
@btime propagate($ourT_circuit, $pstr, $thetas;min_abs_coeff=$min_abs_coeff);

  4.128 ms (1045 allocations: 273.09 KiB)


In [32]:
@btime propagate($libraryT_circuit, $pstr, $thetas; min_abs_coeff=$min_abs_coeff);

  4.182 ms (1250 allocations: 326.22 KiB)


No, because `apply` for the `CustomTGate` is not type-stable.

In [33]:
@code_warntype apply(CustomTGate(7), pstr.term, 0.0)

MethodInstance for apply(::CustomTGate, ::UInt64, ::Float64)
  from apply([90mgate[39m::[1mCustomTGate[22m, [90mpstr[39m, [90mcoeff[39m; kwargs...)[90m @[39m [90mMain[39m [90m[4mIn[22]:1[24m[39m
Arguments
  #self#[36m::Core.Const(PauliPropagation.apply)[39m
  gate[36m::CustomTGate[39m
  pstr[36m::UInt64[39m
  coeff[36m::Float64[39m
Body[33m[1m::Union{Tuple{Tuple{UInt64, Float64}}, Tuple{Tuple{UInt64, Float64}, Tuple{UInt64, Float64}}}[22m[39m
[90m1 ─[39m %1 = Main.:(var"#apply#2")[36m::Core.Const(Main.var"#apply#2")[39m
[90m│  [39m %2 = Core.NamedTuple()[36m::Core.Const(NamedTuple())[39m
[90m│  [39m %3 = Base.pairs(%2)[36m::Core.Const(Base.Pairs{Symbol, Union{}, Tuple{}, @NamedTuple{}}())[39m
[90m│  [39m %4 = (%1)(%3, #self#, gate, pstr, coeff)[33m[1m::Union{Tuple{Tuple{UInt64, Float64}}, Tuple{Tuple{UInt64, Float64}, Tuple{UInt64, Float64}}}[22m[39m
[90m└──[39m      return %4



------------------------------------------------------------

#### NOTE:
Due to ongoing changes in the code base and unclear compiler optimizations, this example works "better than expected". Please stay put and feel free pretending like this function was slower than our implementation. In earlier versions of the code it was, and for other gates it may still be.

------------------------------------------------------------

It either returns a tuple of one tuple `Tuple{Tuple{UInt64, Float64}}` or a tuple of two tuples `Tuple{Tuple{UInt64, Float64}, Tuple{UInt64, Float64}}`. Yellow `@code_warntype` output means it might be okay (it is not that much slower after all), but be wary of red. When this is the case, you may want to define some more involved functions above `apply()` for optimal performance. This is how we would do it. 

To avoid such type instabilities, we can overload a slightly higher level function `applyandadd!()`, which does the job of `apply()`, but as the name hints, also adds the created Pauli strings to the propagating Pauli sum. We can practically copy-paste the code from `apply()`, but the only difference being that we don't return anything, but `add!()` the Pauli strings to the `output_psum`. Be mindful of the fact that the function signature needs to be exactly like this. Even though you might not need a parameter `theta`, it needs to be received by your function.

In [34]:
function PauliPropagation.applyandadd!(gate::CustomTGate, pstr, coeff, theta, output_psum; kwargs...)
    
    pauli = getpauli(pstr, gate.qind)
    
    if pauli == 0 || pauli == 3  # I or Z commute
        add!(output_psum, pstr, coeff)
        return
    end

    if pauli == 1 # X goes to X, -Y
        new_pauli = 2  # Y
        new_pstr = setpauli(pstr, new_pauli, gate.qind)
        new_coeff = -1 * coeff * sin(π/4)
    else # Y goes to Y, X
        new_pauli = 1  # X
        new_pstr = setpauli(pstr, new_pauli, gate.qind)
        new_coeff = coeff * sin(π/4)
    end

    updated_coeff = coeff * cos(π/4)
    
    add!(output_psum, pstr, updated_coeff)
    add!(output_psum, new_pstr, new_coeff)

    return
end

This should resolve the slight type instability. Let's see if it worked and gives the same results.

In [35]:
@time ourT_psum2 = propagate(ourT_circuit, pstr, thetas; min_abs_coeff=min_abs_coeff);

  0.038371 seconds (39.50 k allocations: 2.093 MiB, 88.99% compilation time: 100% of which was recompilation)


In [36]:
overlapwithzero(ourT_psum2)

0.33262899358840403

In [37]:
ourT_psum == ourT_psum2

true

And check the performance.

In [38]:
@btime propagate($ourT_circuit, $pstr, $thetas; min_abs_coeff=$min_abs_coeff);

  6.489 ms (1045 allocations: 273.09 KiB)


In [39]:
@btime propagate($libraryT_circuit, $pstr, $thetas; min_abs_coeff=$min_abs_coeff);

  6.511 ms (1250 allocations: 326.22 KiB)


This is already much better and quite fast. But we still see that it is a bit slower than our inbuilt `TGate`. How so? The answer lies in the fact that we move Pauli strings more than necessary. Because the runtime of the T-gate simulation is dominated by commutation (because I is very comon for local observables), we could leave those commuting Pauli strings where they are -> in their original Pauli sum. For this, we can overload the function `applytoall!()`, which differs in that one performs the loop over the Pauli strings in the Pauli sum here, and one can thus use the old Pauli sum more flexibly. Our convention is that anything left in `psum` or `aux_psum` is later merged back into `psum`. Thus, we can simply skip the commuting Pauli strings, and edit the coefficient of Pauli strings in-place. See this version of the function:

In [40]:
function PauliPropagation.applytoall!(gate::CustomTGate, theta, psum, aux_psum; kwargs...)
    
    for (pstr, coeff) in psum 
    
        pauli = getpauli(pstr, gate.qind)

        if pauli == 0 || pauli == 3  # I or Z commute
            # do nothing
            continue
        end

        if pauli == 1 # X goes to X, -Y
            new_pauli = 2  # Y
            new_pstr = setpauli(pstr, new_pauli, gate.qind)
            new_coeff = -1 * coeff * sin(π/4)
        else # Y goes to Y, X
            new_pauli = 1  # X
            new_pstr = setpauli(pstr, new_pauli, gate.qind)
            new_coeff = coeff * sin(π/4)
        end

        updated_coeff = coeff * cos(π/4)

        set!(psum, pstr, updated_coeff)
        set!(aux_psum, new_pstr, new_coeff)
    end
    return
end

In [41]:
@time ourT_psum2 = propagate(ourT_circuit, pstr, thetas; min_abs_coeff=min_abs_coeff);

  0.065850 seconds (29.42 k allocations: 1.681 MiB, 89.81% compilation time: 100% of which was recompilation)


In [42]:
overlapwithzero(ourT_psum2)

0.33262899358840403

In [43]:
ourT_psum == ourT_psum2

true

And check the performance.

In [44]:
@btime propagate($ourT_circuit, $pstr, $thetas; min_abs_coeff=$min_abs_coeff);

  6.558 ms (1050 allocations: 319.97 KiB)


In [45]:
@btime propagate($libraryT_circuit, $pstr, $thetas; min_abs_coeff=$min_abs_coeff);

  6.523 ms (1250 allocations: 326.22 KiB)


Enjoy defining custom and high-performance gates! 