# Advanced - How to Define Custom Gates
In an earlier example notebook, we showed you the basics of defining custom gates. The approach demonstrated there may already be enough for your purposes, and potentially already as efficient as possible. In this notebook, we will discuss the considerations that go into making high-performing gates. 

In [1]:
using Revise
using PauliPropagation

### The SWAP example

Let us again consider the SWAP gate example

In [2]:
struct CustomSWAPGate <: StaticGate
    qinds::Tuple{Int, Int}  # The two sites to be swapped
end

Again define the action,

In [3]:
function PauliPropagation.apply(gate::CustomSWAPGate, pstr, coeff; kwargs...)
    # get the Pauli on the first site
    pauli1 = getpauli(pstr, gate.qinds[1])
    # get the Pauli on the second site
    pauli2 = getpauli(pstr, gate.qinds[2])
    
    # set the Pauli on the first site to the second Pauli
    pstr = setpauli(pstr, pauli2, gate.qinds[1])
    # set the Pauli on the second site to the first Pauli
    pstr = setpauli(pstr, pauli1, gate.qinds[2])

    # apply() is always expected to return a tuple of (pstr, coeff) tuples
    return tuple((pstr, coeff))
end

Now set up a bigger simulation with 25 qubits on a 5 by 5 grid.

In [4]:
nx = 5
ny = 5
nq = nx * ny

topology = rectangletopology(nx, ny);

`nl` layers of a circuit consisting of `RX` and `RZZ` Pauli rotations, but insert a layer of swaps in between.

In [5]:
trotter_layer = tfitrottercircuit(nq, 1; topology=topology);

nl = 3
ourSWAP_circuit = Gate[]
append!(ourSWAP_circuit, trotter_layer)
for _ in 2:nl
    append!(ourSWAP_circuit, (CustomSWAPGate(pair) for pair in topology))
    append!(ourSWAP_circuit, trotter_layer)
end

In [6]:
nparams = countparameters(ourSWAP_circuit)

195

Define our observable as $ Z_7 Z_{13} $.

In [7]:
pstr = PauliString(nq, [:Z, :Z], [7, 13])

PauliString(nqubits: 25, 1.0 * IIIIIIZIIIIIZIIIIIII...)

Circuit parameters with a random seed.

In [8]:
using Random
Random.seed!(42)
thetas = randn(nparams);

For this notebook, we will use a minimum coefficient threshold. The results are still almost exact in this simple case.

In [9]:
min_abs_coeff = 1e-3

0.001

Run the circuit

In [10]:
@time ourSWAP_psum = propagate(ourSWAP_circuit, pstr, thetas; min_abs_coeff=min_abs_coeff)

  0.773624 seconds (3.36 M allocations: 168.195 MiB, 1.74% gc time, 96.17% compilation time)


PauliSum(nqubits: 25, 13198 Pauli terms:
 -0.013537 * IIZZIZXXIIIIYZIIIZII...
 0.0025156 * IIIIIIXYIIIIZIIIIXZI...
 -0.0016705 * IIIIIIYYXIIXYZIIZZII...
 0.002661 * IIIZIIZYYIIZYXIIIZZI...
 -0.0027996 * IIXZIIYYIIIIYZIIIYZI...
 -0.001725 * IIZZIIYYZIIIZYIIIIZI...
 0.0013354 * IIZZIYYYIIIIXIIIIXII...
 0.001438 * IIIIIIXYZIIZIYIIIIZZ...
 0.0015348 * IZIIIIXIIIIXYIIIIXZI...
 -0.0018285 * IZIIIZYIIIIYYIIIIYZI...
 0.0034953 * IIIIIIZYIIIIZIIIIZII...
 0.0027765 * IIIIIIYIZIIXXXZIZYZI...
 -0.0010834 * IIZIIZXYZIIIYYZIIZZZ...
 -0.0030723 * IIZZIYYZZIIIYYIIIYII...
 -0.0011065 * IZIZIIXZIIIIYZIIIXZI...
 0.0025227 * IIXIIZZIIIIZXIIIIYZI...
 0.001622 * IZIIIYYYIIIXXIIIZZII...
 -0.0020525 * IIIIIYYYIIIXYIIIZZII...
 -0.0025499 * IIIIIIXIIIIXXIIIZXZI...
 -0.0066057 * IIYZIZZIIIIZXIIIIZII...
  ⋮)

Overlap with the zero-state:

In [11]:
overlapwithzero(ourSWAP_psum)

0.11159958430219122

We already mentioned that `PauliPropagation.jl` contains a `CliffordGate` implementation of SWAP. Let's implement the same thing and compare performance.

In [12]:
cliffSWAP_circuit = Gate[]
append!(cliffSWAP_circuit, trotter_layer)
for _ in 2:nl
    append!(cliffSWAP_circuit, (CliffordGate(:SWAP, pair) for pair in topology))
    append!(cliffSWAP_circuit, trotter_layer)
end

In [13]:
@time cliffSWAP_psum = propagate(cliffSWAP_circuit, pstr, thetas; min_abs_coeff=min_abs_coeff);

  0.083717 seconds (78.94 k allocations: 5.705 MiB, 63.15% compilation time)


Are the results the same?

In [14]:
overlapwithzero(cliffSWAP_psum)

0.11159958430219122

In [15]:
cliffSWAP_psum == ourSWAP_psum

true

Yes!

We can also benchmark the performance.

In [16]:
using BenchmarkTools

In [17]:
@btime propagate($ourSWAP_circuit, $pstr, $thetas; min_abs_coeff=$min_abs_coeff);

  28.782 ms (2513 allocations: 1.90 MiB)


In [18]:
@btime propagate($cliffSWAP_circuit, $pstr, $thetas; min_abs_coeff=$min_abs_coeff);

  29.632 ms (2513 allocations: 1.90 MiB)


No downside at all from defining our custom gate. How? This is because the `apply` function for this gate is *type stable*! Type stability is absolutely crucial in Julia, and codes live and die by it.

In [19]:
@code_warntype apply(CustomSWAPGate((1, 1)), pstr.term, 1.0)

MethodInstance for apply(::CustomSWAPGate, ::PauliPropagation.UInt56, ::Float64)
  from apply([90mgate[39m::[1mCustomSWAPGate[22m, [90mpstr[39m, [90mcoeff[39m; kwargs...)[90m @[39m [90mMain[39m [90m[4mIn[3]:1[24m[39m
Arguments
  #self#[36m::Core.Const(PauliPropagation.PropagationBase.apply)[39m
  gate[36m::CustomSWAPGate[39m
  pstr[36m::PauliPropagation.UInt56[39m
  coeff[36m::Float64[39m
Body[36m::Tuple{Tuple{PauliPropagation.UInt56, Float64}}[39m
[90m1 ─[39m %1 = Main.:(var"#apply#1")[36m::Core.Const(Main.var"#apply#1")[39m
[90m│  [39m %2 = Core.NamedTuple()[36m::Core.Const(NamedTuple())[39m
[90m│  [39m %3 = Base.pairs(%2)[36m::Core.Const(Base.Pairs{Symbol, Union{}, Tuple{}, @NamedTuple{}}())[39m
[90m│  [39m %4 = (%1)(%3, #self#, gate, pstr, coeff)[36m::Tuple{Tuple{PauliPropagation.UInt56, Float64}}[39m
[90m└──[39m      return %4



All blue means that everything is great! If correctly implemented, `apply` will be type stable if it returns a known number of Pauli and coefficient pairs. Here it is just 1 because it is a Clifford gate.

### A gate that branches into more than one Pauli string

Onto an example of a gate that can _split_ a Pauli string into two: The `T` gate.

In [20]:
struct CustomTGate <: StaticGate
    qind::Int
end

A `T` gate is a non-Clifford gate that commutes with `I` and `Z`, splits `X` into `cos(π/4)X - sin(π/4)Y`, and `Y` into `cos(π/4)Y + sin(π/4)X` (in the Heisenberg picture). 

Let's write the code for that.

In [21]:
function PauliPropagation.apply(gate::CustomTGate, pstr, coeff; kwargs...)
    # get the Pauli on the site `gate.qind`
    pauli = getpauli(pstr, gate.qind)
    
    if pauli == 0 || pauli == 3  # I or Z commute
        # return a tuple of one (pstr, coeff) tuple
        return tuple((pstr, coeff))     
    end
    
    if pauli == 1 # X goes to X, -Y
        new_pauli = 2  # Y
        # set the Pauli
        new_pstr = setpauli(pstr, new_pauli, gate.qind)
        # adapt the coefficients
        new_coeff = -1 * coeff * sin(π/4)
        
    else # Y goes to Y, X
        new_pauli = 1  # X
        # set the Pauli
        new_pstr = setpauli(pstr, new_pauli, gate.qind)
        # adapt the coefficients
        new_coeff = coeff * sin(π/4)
    end

    updated_coeff = coeff * cos(π/4)

    # return a tuple of two (pstr, coeff) tuples
    return tuple((pstr, updated_coeff), (new_pstr, new_coeff))
    
end

Insert a layer of `TGate`s after each Trotter layer.

In [22]:
ourT_circuit = Gate[]
for _ in 1:nl
    append!(ourT_circuit, trotter_layer)
    append!(ourT_circuit, (CustomTGate(qind) for qind in 1:nq)) 
end

And run:

In [23]:
@time ourT_psum = propagate(ourT_circuit, pstr, thetas; min_abs_coeff=min_abs_coeff)

  0.084521 seconds (90.52 k allocations: 6.723 MiB, 47.04% compilation time)


PauliSum(nqubits: 25, 15787 Pauli terms:
 -0.0032424 * IXIIIIZXXIIZXZIIIZII...
 0.0037487 * IZIIIYXZIIIXXIIIZZII...
 -0.0018537 * IXIIIIYIIIIZXZIIIYZI...
 -0.0037346 * IZIIIYXYIIIXYZIIZZII...
 0.0013272 * IYIIIYYXXIIIYZIIIZII...
 0.0010339 * IIIIIIZIZZIIYXYIIYIZ...
 0.0011072 * IZIIIIXIIIIXYIIIIXZI...
 -0.0010679 * IIIZIIZIYIIIYZZIIZII...
 -0.0010275 * IXZIIYXYZIIZXZIIIZII...
 -0.0017435 * IIIIIIYIZZIYZYYIZIZZ...
 -0.0013191 * IZIIIZYIIIIYYIIIIYZI...
 0.0019275 * IIIIIIYIZIIXXXZIZYZI...
 -0.0011318 * IYZIIYXYYIIZYYZIIZZI...
 0.0023645 * IIIIIZYYIIIZZIIIIIII...
 -0.0011432 * IXIIIIIYIIIIYIIIIZII...
 0.002118 * IYIZIIZYXIIZXXIIIZZI...
 -0.0013364 * IIIIIZXIIIIIXXIIIYII...
 -0.0011092 * IIIIIZZZIIIZYIIIIYII...
 0.0010239 * IYXZIZYYIIIIYZIIIZII...
 0.0012344 * IYIIIYXZIIIIXIIIIYZI...
  ⋮)

In [24]:
overlapwithzero(ourT_psum)

0.3146654070299997

But did it work? Again, we have an implementation of a `TGate` in our library. In case you are interested, we currently implement `T` gates as Pauli `Z` rotations at an angle of `π/4`. Let's compare to that.

In [25]:
libraryT_circuit = Gate[]
for _ in 1:nl
    append!(libraryT_circuit, trotter_layer)
    append!(libraryT_circuit, (TGate(qind) for qind in 1:nq)) 
end

If you call `PauliGate(:Z, qind, parameter)`, this will create a so-called `FrozenGate` wrapping the parametrized `PauliGate`, with a fixed `parameter` at the time of circuit construction.

Run it and compare

In [26]:
@time libraryT_psum = propagate(libraryT_circuit, pstr, thetas; min_abs_coeff=min_abs_coeff);

  0.054624 seconds (10.96 k allocations: 2.714 MiB, 12.29% gc time, 12.67% compilation time)


In [27]:
overlapwithzero(libraryT_psum)

0.3146654070299997

In [28]:
libraryT_psum == ourT_psum

true

It works! But is it optimal?

In [29]:
using BenchmarkTools

In [30]:
@btime propagate($ourT_circuit, $pstr, $thetas;min_abs_coeff=$min_abs_coeff);

  65.116 ms (2509 allocations: 2.29 MiB)


In [31]:
@btime propagate($libraryT_circuit, $pstr, $thetas; min_abs_coeff=$min_abs_coeff);

  62.681 ms (3564 allocations: 2.34 MiB)


No, because `apply` for the `CustomTGate` is not type-stable.

In [32]:
@code_warntype apply(CustomTGate(1), pstr.term, 0.0)

MethodInstance for apply(::CustomTGate, ::PauliPropagation.UInt56, ::Float64)
  from apply([90mgate[39m::[1mCustomTGate[22m, [90mpstr[39m, [90mcoeff[39m; kwargs...)[90m @[39m [90mMain[39m [90m[4mIn[21]:1[24m[39m
Arguments
  #self#[36m::Core.Const(PauliPropagation.PropagationBase.apply)[39m
  gate[36m::CustomTGate[39m
  pstr[36m::PauliPropagation.UInt56[39m
  coeff[36m::Float64[39m
Body[33m[1m::Union{Tuple{Tuple{PauliPropagation.UInt56, Float64}}, Tuple{Tuple{PauliPropagation.UInt56, Float64}, Tuple{PauliPropagation.UInt56, Float64}}}[22m[39m
[90m1 ─[39m %1 = Main.:(var"#apply#6")[36m::Core.Const(Main.var"#apply#6")[39m
[90m│  [39m %2 = Core.NamedTuple()[36m::Core.Const(NamedTuple())[39m
[90m│  [39m %3 = Base.pairs(%2)[36m::Core.Const(Base.Pairs{Symbol, Union{}, Tuple{}, @NamedTuple{}}())[39m
[90m│  [39m %4 = (%1)(%3, #self#, gate, pstr, coeff)[33m[1m::Union{Tuple{Tuple{PauliPropagation.UInt56, Float64}}, Tuple{Tuple{PauliPropagation.UInt56, F

It either returns a tuple of one tuple `Tuple{Tuple{UInt64, Float64}}` or a tuple of two tuples `Tuple{Tuple{UInt64, Float64}, Tuple{UInt64, Float64}}`. Yellow `@code_warntype` output means it might be okay (it is not that much slower after all), but be wary of red. When this is the case, you may want to define some more involved functions above `apply()` for optimal performance. This is how we would do it. 

To avoid such type instabilities, we can overload a higher level function `applytoall!()`. This is also usefull because the runtime of the T-gate simulation is dominated by commutation (because the Pauli I is very comon for local observables), we could leave those commuting Pauli strings where they are -> in their original Pauli sum. For this, we can overload the function `applytoall!()`, which differs in that one performs the loop over the Pauli strings in the Pauli sum here, and one can thus use the old Pauli sum more flexibly. You will receive a `AbstractPauliPropagationCache` object (here a `PauliPropagationCache` that wraps two `PauliSum` objects. Our convention is that anything left in the main `psum` or the auxiliary `aux_psum` is later merged back into `psum`. Thus, we can simply skip the commuting Pauli strings, and edit the coefficient of Pauli strings in-place. See this version of the function:

In [33]:
function PauliPropagation.applytoall!(gate::CustomTGate, prop_cache::AbstractPauliPropagationCache; kwargs...)
    psum = mainsum(prop_cache)
    aux_psum = auxsum(prop_cache)
    
    for (pstr, coeff) in psum 
        # the content of the previous apply() function:
        pauli = getpauli(pstr, gate.qind)

        if pauli == 0 || pauli == 3  # I or Z commute
            # do nothing
            continue
        end

        if pauli == 1 # X goes to X, -Y
            new_pauli = 2  # Y
            new_pstr = setpauli(pstr, new_pauli, gate.qind)
            new_coeff = -1 * coeff * sin(π/4)
        else # Y goes to Y, X
            new_pauli = 1  # X
            new_pstr = setpauli(pstr, new_pauli, gate.qind)
            new_coeff = coeff * sin(π/4)
        end

        updated_coeff = coeff * cos(π/4)

        # you can use add!() if the Pauli sums may already contain that string 
        # here we know they don't, so we use set!() for minorly better performance
        set!(psum, pstr, updated_coeff)
        set!(aux_psum, new_pstr, new_coeff)
    end
    return prop_cache
end

In [34]:
@time ourT_psum2 = propagate(ourT_circuit, pstr, thetas; min_abs_coeff=min_abs_coeff);

  0.107546 seconds (37.54 k allocations: 4.003 MiB, 41.89% compilation time: 100% of which was recompilation)


In [35]:
overlapwithzero(ourT_psum2)

0.3146654070299997

In [36]:
ourT_psum == ourT_psum2

true

And check the performance. It is equivalent for all intents and purposes.

In [37]:
@btime propagate($ourT_circuit, $pstr, $thetas; min_abs_coeff=$min_abs_coeff);

  61.369 ms (2514 allocations: 2.30 MiB)


In [38]:
@btime propagate($libraryT_circuit, $pstr, $thetas; min_abs_coeff=$min_abs_coeff);

  62.355 ms (3564 allocations: 2.34 MiB)


Enjoy defining custom and high-performance gates! 