# Available backends and their usage

There are many backends available with CUDA Quantum which enable seamless switching between GPUs, QPUs and CPUs and also allow for workflows involing multiple architectures working in tandem. 

In [1]:
import cudaq

targets = cudaq.get_targets() 

# for t in targets: 
#     print(t)

- **default**: The default qpp based CPU backend which is multithreaded to maximise the usage of available cores on your system.

- **nvidia**: GPU based backend which accelerates quantum circuit simulation on NVIDIA GPUs powered by cuQuantum.

- **nvidia-mqpu**: Enables users to program workflows utilizing multiple quantum processors enabled today by GPU emulation. 

- **nvidia-mgpu**: Allows for scaling circuit simulation beyond what is feasible with any QPU today. 

- **density-matrix-cpu**: Noisy simulations via density matrix calculations. CPU only for now with GPU support coming soon. 

Below we explore some of the workflows made possible by these backends.

In [3]:
#Lets define a function which allows us to generate a n qubit GHZ state 

import cudaq

def ghz_state(n_qubits, target): 

    cudaq.set_target(target)

    kernel = cudaq.make_kernel()

    qubits = kernel.qalloc(n_qubits)

    kernel.h(qubits[0])

    for i in range(1, n_qubits):
        kernel.cx(qubits[0], qubits[i])
        
    kernel.mz(qubits)

    result = cudaq.sample(kernel, shots_count = 1000)

    return result 


# Default CPU backend 



In [3]:
cpu_result = ghz_state(n_qubits = 2, target = 'default')

cpu_result.dump()

{ 00:468 11:532 }


# Acceleration via NVIDIA GPUs

Users will notice a **200x speedup** in executing the circuit below on NVIDIA GPUs vs CPUs.

In [4]:
gpu_result = ghz_state(n_qubits = 25, target = 'nvidia')

gpu_result.dump()

{ 0000000000000000000000000:512 1111111111111111111111111:488 }


# Multiple NVIDIA GPUs

A $n$ qubit quantum state has $2^n$ complex amplitudes, each of which require 8 bytes of memory to store. Hence the total memory required to store a $n$ qubit quantum state is $8$ bytes $\times 2^n$. For $n = 30$ qubits, this is roughly $8$ GB but for $n = 40$, this exponentially increases to 8700 GB. 

If one incrementally increases the qubit count in their circuit, we reach a limit where the memory required is beyond the capabilities of a single GPU. The `nvidia-mgpu` target allows for memory from additional GPUs to be pooled enabling qubit counts to be scaled.  



In [5]:
#The nvidia-mgpu backend allows for qubit counts to scale beyond what is feasible for a single gpu 

#This is executed on a node with 4 A100 chips with 80GB memory each  

# mgpu_result = ghz_state(n_qubits = 34, target = 'nvidia-mgpu')

# mgpu_result.dump()

# Multiple QPUs

The `nvidia-mqpu` backend allows for future workflows made possible via GPU simulation today. 


## Asynchronous data collection via batching hamiltonian terms



<img src="hsplit.png" alt="Alt Text" width="500" height="200">


In [6]:
import cudaq
from cudaq import spin

n_qubits = 10
n_terms = 100000  

cudaq.set_target('nvidia-mqpu')

kernel = cudaq.make_kernel()

qubits = kernel.qalloc(n_qubits)

kernel.h(qubits[0])

for i in range(1, n_qubits):
    kernel.cx(qubits[0], qubits[i])
    
#We create a random hamiltonian with 10e5 terms

hamiltonian = cudaq.SpinOperator.random(n_qubits, n_terms)   

#The observe calls allows us to calculate the expectation value of the hamiltonian and automatically batches the terms and distributes them over the multiple QPUs/ GPUs

exp_val = cudaq.observe(kernel, hamiltonian)

exp_val.expectation_z()


-9.827666697940636e-05

## Asynchronous data collection via circuit batching

<img src="circsplit.png" alt="Alt Text" width="500" height="200">


In [7]:
import cudaq
from cudaq import spin
import numpy as np
np.random.seed(1)

cudaq.set_target('nvidia-mqpu')

n_qubits = 5
n_samples = 1000
h = spin.z(0) 
n_parameters = n_qubits

#Below we run a circuit for 1000 different input parameters 
parameters = np.random.default_rng(13).uniform(low=0, high=1, size = (n_samples,n_parameters))

kernel, params = cudaq.make_kernel(list)

qubits = kernel.qalloc(n_qubits)
qubits_list = list(range(n_qubits))

for i in range(n_qubits):
    kernel.rx(params[i], qubits[i])


In [8]:
%timeit result = cudaq.observe_n(kernel, h, parameters)  #observe_n allows for parameter broadcasting 

3.35 s ± 160 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [9]:
print(parameters.shape)

xi = np.split(parameters, 4)  #We split our parameters into 4 arrays since we have 4 GPUs available 

print(len(xi))

print(xi[0].shape, xi[1].shape, xi[2].shape, xi[3].shape)

(1000, 5)
4
(250, 5) (250, 5) (250, 5) (250, 5)


In [10]:
%%timeit 
#Timing the execution on a single GPU vs 4 GPUs, users will see a 4x performance improvement 

asyncresults = []

for i in range(len(xi)):
    for j in range(xi[i].shape[0]):
        asyncresults.append(cudaq.observe_async(kernel, h, xi[i][j,:], qpu_id = i))

93 ms ± 341 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


## Noisy simulations 

Quantum noise can be characterised into coherent and incoherent sources of errors that arise during a computation. Coherent noise is commonly due to systematic erorrs originating from device miscalibrations, for example, gats implementing a rotation $\theta + \epsilon$ instead of $\theta$.

Incoherent noise has its origins in quantum states being entangled with the environment due to decoherence. This leads to mixed states which are probability distributions over pure states and are described by employing the density matrix formalism. 

We can model incoherent noise via quantum channels which are linear, completely positive, and trave preserving maps. The mathematical language used is of Kraus operators $ \{ K_i \}$ which satisfy the condition $\sum_{i} K_i^\dagger K_i = \mathbb{I}$. 

The bit flip operation flips the qubit with probability $p$ and leaves it unchanged with probability $1-p$. This can be represented by employing Kraus operators: 


$K_0 = \sqrt{1-p} \begin{pmatrix}
  1 & 0 \\
  0 & 1
\end{pmatrix} $


$K_0 = \sqrt{p} \begin{pmatrix}
  0 & 1 \\
  1 & 0
\end{pmatrix} $

Lets implement this using CUDA Quantum 

In [1]:
import cudaq
import numpy as np

cudaq.set_target('density-matrix-cpu')

#Lets define a circuit
n_qubits = 2
kernel = cudaq.make_kernel()
q = kernel.qalloc(n_qubits)
kernel.x(q[0])
kernel.x(q[1])

#In the ideal noiseless case, we get 11 100% of the time as expected 
ideal_counts = cudaq.sample(kernel, shots_count=1000)
ideal_counts.dump()


#You can build your own Kraus channels 
p = 0.1 #probability of error 

k0 = np.sqrt(1-p) * np.array([[1.0, 0.0], [0.0, 1.0]], dtype=np.complex128)
k1 = np.sqrt(p) * np.array([[0.0, 1.0], [1.0, 0.0]], dtype=np.complex128)

bitflip = cudaq.KrausChannel([k0, k1])

#You can also use built in noise channels 
depol = cudaq.DepolarizationChannel(p)

#Add the noise models
noise = cudaq.NoiseModel()
noise.add_channel("x", [0], depol)
noise.add_channel("x", [1], bitflip)

#We see unwanted results due to the effects of the noise channels 
noisy_counts = cudaq.sample(kernel, noise_model=noise, shots_count=1000)
noisy_counts.dump()


{ 11:1000 }
{ 11:713 10:154 01:114 00:19 }


# Running on hardware

CUDA Quantum can efficiently target diverse quantum computing architectures, including superconducting circuits, ion traps, neutral atoms, diamond-based, photonic systems, and more. 

The nvq++ compiler automatically compiles and executes the program ont he designated architecture. 

We have already announced hardware integrations with major QPU providers and are working on brining them online. 



In [None]:
#Switching between CPUs, GPUs and QPUs is as easy as changing a string input 

cpu_result = ghz_state(n_qubits = 2, target = 'default')
gpu_result = ghz_state(n_qubits = 2, target = 'nvidia')
# qpu_result = ghz_state(n_qubits = 2, target = 'quantinuum')

