In [None]:
# This cell is added by sphinx-gallery
# It can be customized to whatever you like
%matplotlib inline

# Kernel-based training of quantum models with scikit-learn

Over the last few years, quantum machine learning research has provided
a lot of insights on how we can understand and train quantum circuits as
machine learning models. While many connections to neural networks have
been made, it becomes increasingly clear that their mathematical
foundation is intimately related to so-called *kernel methods*, the most
famous of which is the [support vector machine
(SVM)](https://en.wikipedia.org/wiki/Support-vector_machine) (see for
example [Schuld and Killoran (2018)](https://arxiv.org/abs/1803.07128),
[Havlicek et al. (2018)](https://arxiv.org/abs/1804.11326), [Liu et al.
(2020)](https://arxiv.org/abs/2010.02174), [Huang et al.
(2020)](https://arxiv.org/pdf/2011.01938), and, for a systematic summary
which we will follow here, [Schuld
(2021)](https://arxiv.org/abs/2101.11020)).

The link between quantum models and kernel methods has important
practical implications: we can replace the common [variational
approach](https://pennylane.ai/qml/glossary/variational_circuit) to
quantum machine learning with a classical kernel method where the
kernel---a small building block of the overall algorithm---is computed
by a quantum device. In many situations there are guarantees that we get
better or at least equally good results.

This demonstration explores how kernel-based training compares with
[variational
training](https://pennylane.ai/qml/demos/tutorial_variational_classifier)
in terms of the number of quantum circuits that have to be evaluated.
For this we train a quantum machine learning model with a kernel-based
approach using a combination of PennyLane and the
[scikit-learn](https://scikit-learn.org/) machine learning library. We
compare this strategy with a variational quantum circuit trained via
stochastic gradient descent using
[PyTorch](https://pennylane.readthedocs.io/en/stable/introduction/interfaces/torch.html).

We will see that in a typical small-scale example, kernel-based training
requires only a fraction of the number of quantum circuit evaluations
used by variational circuit training, while each evaluation runs a much
shorter circuit. In general, the relative efficiency of kernel-based
methods compared to variational circuits depends on the number of
parameters used in the variational model.

![](https://blog-assets.cloud.pennylane.ai/demos/tutorial_kernel_based_training/main/_assets/static/demonstration_assets/kernel_based_training/scaling.png)

If the number of variational parameters remains small, e.g., there is a
square-root-like scaling with the number of data samples (green line),
variational circuits are almost as efficient as neural networks (blue
line), and require much fewer circuit evaluations than the quadratic
scaling of kernel methods (red line). However, with current
hardware-compatible training strategies, kernel methods scale much
better than variational circuits that require a number of parameters of
the order of the training set size (orange line).

In conclusion, **for quantum machine learning applications with many
parameters, kernel-based training can be a great alternative to the
variational approach to quantum machine learning**.

After working through this demo, you will:

-   be able to use a support vector machine with a quantum kernel
    computed with PennyLane, and
-   be able to compare the scaling of quantum circuit evaluations
    required in kernel-based versus variational training.


# Background

Let us consider a *quantum model* of the form

$$f(x) = \langle \phi(x) | \mathcal{M} | \phi(x)\rangle,$$

where $| \phi(x)\rangle$ is prepared by a fixed [embedding
circuit](https://pennylane.ai/qml/glossary/quantum_embedding) that
encodes data inputs $x,$ and $\mathcal{M}$ is an arbitrary observable.
This model includes variational quantum machine learning models, since
the observable can effectively be implemented by a simple measurement
that is preceded by a variational circuit:

![](https://blog-assets.cloud.pennylane.ai/demos/tutorial_kernel_based_training/main/_assets/static/demonstration_assets/kernel_based_training/quantum_model.png)

| 

For example, applying a circuit $G(\theta)$ and then measuring the
Pauli-Z observable $\sigma^0_z$ of the first qubit implements the
trainable measurement
$\mathcal{M}(\theta) = G^{\dagger}(\theta) \sigma^0_z G(\theta).$

The main practical consequence of approaching quantum machine learning
with a kernel approach is that instead of training $f$ variationally, we
can often train an equivalent classical kernel method with a kernel
executed on a quantum device. This *quantum kernel* is given by the
mutual overlap of two data-encoding quantum states,

$$\kappa(x, x') = | \langle \phi(x') | \phi(x)\rangle|^2.$$

Kernel-based training therefore bypasses the processing and measurement
parts of common variational circuits, and only depends on the data
encoding.

If the loss function $L$ is the [hinge
loss](https://en.wikipedia.org/wiki/Hinge_loss), the kernel method
corresponds to a standard [support vector
machine](https://en.wikipedia.org/wiki/Support-vector_machine) (SVM) in
the sense of a maximum-margin classifier. Other convex loss functions
lead to more general variations of support vector machines.

> > Note
>
> More precisely, we can replace variational with kernel-based training
> if the optimisation problem can be written as minimizing a cost of the
> form
>
> $$\min_f  \lambda\;  \mathrm{tr}\{\mathcal{M}^2\} + \frac{1}{M}\sum_{m=1}^M L(f(x^m), y^m),$$
>
> which is a regularized empirical risk with training data samples
> $(x^m, y^m)_{m=1\dots M},$ regularization strength
> $\lambda \in \mathbb{R},$ and loss function $L.$
>
> Theory predicts that kernel-based training will always find better or
> equally good minima of this risk. However, to show this here we would
> have to either regularize the variational training by the trace of the
> squared observable, or switch off regularization in the classical SVM,
> which removes a lot of its strength. The kernel-based and the
> variational training in this demonstration therefore optimize slightly
> different cost functions, and it is out of our scope to establish
> whether one training method finds a better minimum than the other.


# Kernel-based training

First, we will turn to kernel-based training of quantum models. As
stated above, an example implementation is a standard support vector
machine with a kernel computed by a quantum circuit.


We begin by importing all sorts of useful methods:


In [None]:
import numpy as np

from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

import pennylane as qml
from pennylane.templates import AngleEmbedding

import matplotlib.pyplot as plt

np.random.seed(42)

The second step is to define a data set. Since the performance of the
models is not the focus of this demo, we can just use the first two
classes of the famous [Iris data
set](https://en.wikipedia.org/wiki/Iris_flower_data_set). Dating back to
as far as 1936, this toy data set consists of 100 samples of four
features each, and gives rise to a very simple classification problem.


In [None]:
X, y = load_iris(return_X_y=True)

# pick inputs and labels from the first two classes only,
# corresponding to the first 100 samples
X = X[:100]
y = y[:100]

# scaling the inputs is important since the embedding we use is periodic
scaler = StandardScaler().fit(X)
X_scaled = scaler.transform(X)

# scaling the labels to -1, 1 is important for the SVM and the
# definition of a hinge loss
y_scaled = 2 * (y - 0.5)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y_scaled)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

We use the [angle-embedding
template](https://pennylane.readthedocs.io/en/stable/code/api/pennylane.templates.embeddings.AngleEmbedding.html)
which needs as many qubits as there are features:


In [None]:
n_qubits = len(X_train[0])
n_qubits

To implement the kernel we could prepare the two states
$| \phi(x) \rangle,$ $| \phi(x') \rangle$ on different sets of qubits
with angle-embedding routines $S(x), S(x'),$ and measure their overlap
with a small routine called a [SWAP
test](https://en.wikipedia.org/wiki/Swap_test).

However, we need only half the number of qubits if we prepare
$| \phi(x)\rangle$ and then apply the inverse embedding with $x'$ on the
same qubits. We then measure the projector onto the initial state
$|0..0\rangle \langle 0..0|.$

![](https://blog-assets.cloud.pennylane.ai/demos/tutorial_kernel_based_training/main/_assets/static/demonstration_assets/kernel_based_training/kernel_circuit.png)

To verify that this gives us the kernel:

$$\begin{aligned}
\begin{align*}
    \langle 0..0 |S(x') S(x)^{\dagger} \mathcal{M} S(x')^{\dagger} S(x)  | 0..0\rangle &= \langle 0..0 |S(x') S(x)^{\dagger} |0..0\rangle \langle 0..0| S(x')^{\dagger} S(x)  | 0..0\rangle  \\
    &= |\langle 0..0| S(x')^{\dagger} S(x)  | 0..0\rangle |^2\\
    &= | \langle \phi(x') | \phi(x)\rangle|^2 \\
    &= \kappa(x, x').
\end{align*}
\end{aligned}$$

Note that a projector $|0..0 \rangle \langle 0..0|$ can be constructed
using the `qml.Hermitian` observable in PennyLane.

Altogether, we use the following quantum node as a *quantum kernel
evaluator*. For this purpose, we will use the `scaleway.aer` device from
the Scaleway plugin for PennyLane. Define the following credentials directly in the cell or set them as environment variables:


In [None]:
import os

project_id = os.getenv("SCW_PROJECT_ID")
secret_key = os.getenv("SCW_SECRET_KEY")
backend_name = "EMU-AER-16C-128M"

print(f"Project ID: {project_id}")
print(f"Secret Key: {secret_key}")
print(f"Backend Name: {backend_name}")

Here we define our pennylane's quantum node along with the chosen device:

In [None]:
dev_kernel = qml.device(
    "scaleway.aer",
    wires=n_qubits,
    project_id=project_id,
    secret_key=secret_key,
    backend=backend_name,
)


@qml.qnode(dev_kernel)
def kernel(x1, x2):
    """The quantum kernel."""
    AngleEmbedding(x1, wires=range(n_qubits))
    qml.adjoint(AngleEmbedding)(x2, wires=range(n_qubits))
    return qml.probs(wires=range(n_qubits))


def kernel_wrapper(x1, x2):
    """x1, x2 can be floats or arrays."""
    return kernel(x1, x2)[..., 0]


qml.draw_mpl(kernel, decimals=1, style="pennylane")(X_train[0], X_train[0])

A good sanity check is whether evaluating the kernel of a data point and
itself returns 1:


In [None]:
kernel_wrapper(X_train[0], X_train[0])

The way an SVM with a custom kernel is implemented in scikit-learn
requires us to pass a function that computes a matrix of kernel
evaluations for samples in two different datasets A, B. If A=B, this is
the [Gram matrix](https://en.wikipedia.org/wiki/Gramian_matrix).

We define our kernel matrices function in the following cell:

In [None]:
def kernel_matrix(A, B):
    """Compute the matrix whose entries are the kernel
    evaluated on pairwise data from sets A and B."""
    inputs = np.array([[a, b] for a in A for b in B])
    return kernel_wrapper(inputs[:, 0], inputs[:, 1]).reshape(len(A), len(B))


def kernel_matrix_training_optimized(A):
    """
    Optimized version of kernel_matrix for training data.
    We can take advantage of the fact that in this case, we compute the Gram matrix.
    """
    # This is a bit of a hack to make sure we don't compute the same kernel evaluation twice, the Gram matrix being symmetric.
    # We avoid unecessary calls to make the training faster and less costly.

    inputs = []
    for i, a in enumerate(A):
        for j, b in enumerate(A[i:], i):
            if i == j:
                continue
            inputs += [[a, b]]

    inputs = np.array(inputs)
    K_upper = kernel_wrapper(inputs[:, 0], inputs[:, 1])

    K = np.ones((len(A), len(A)))
    K[np.triu_indices_from(K, k=1)] = K_upper
    K.T[np.triu_indices_from(K.T, k=1)] = K_upper
    return K

Quick sanity check on 5 data points:

In [None]:
mat = kernel_matrix_training_optimized(X_train[:5])
plt.imshow(mat)
plt.colorbar()
plt.title("Kernel matrix on sample data (5 points)")
plt.show()

We compute the full kernel matrix:

In [None]:
gram_matrix = kernel_matrix_training_optimized(X_train)
print(gram_matrix.shape)

Evaluation should take ~22 minutes (~2.1 circuits evaluated per second - 0.476 s/circ). Let's take a look just for fun:

In [None]:
plt.imshow(gram_matrix)
plt.colorbar()
plt.title("Kernel matrix")
plt.show()

Training the SVM optimizes internal parameters that basically weigh
kernel functions. It is a breeze in scikit-learn, which is designed as a
high-level machine learning library.
Let's compute the accuracy on the test set once training is done.


In [None]:
with dev_kernel.tracker:
    svm = SVC(kernel=lambda x, y: gram_matrix).fit(X_train, y_train)
    svm.set_params(kernel=kernel_matrix)
    predictions = svm.predict(X_test)
    accuracy = accuracy_score(predictions, y_test)

print(f"predictions.shape = {predictions.shape}")
print(f"y_test.shape = {y_test.shape}")
print(f"Accuracy: {accuracy_score(y_test, predictions)}")

In [None]:
# We stop the device in order to stop our remote session and avoid unecessary cost!
dev_kernel.stop()

The SVM predicted all test points correctly. How many times was the
quantum device evaluated during training and testing?

In [None]:
dev_kernel.tracker.totals["executions"]