### **Kernel Ridge Regression Learning for Large-Scale Data Using ANOVA-Based Matrix-Vector Multiplication**

#### **Problem Setting**

Solving the kernel ridge regression learning task
$$
\hat{\mathbf{\alpha}} = \underset{\mathbf{\alpha} \in \mathbb{R}^N}{\arg \text{min}} \| \mathbf{y} - K \mathbf{\alpha} \|_2^2 + \beta \mathbf{\alpha}^\top  K \mathbf{\alpha},
$$
where $\mathbf{y} = \begin{bmatrix} {y}_{1}, \dots, {y}_{N} \end{bmatrix}^\top$ is the target vector, $K = \left( \kappa_{ij} \right)_{i,j=1}^N$ is the kernel matrix and $\beta$ is the regularization parameter, is equivalent to solving the linear system
$$
\left( K + \beta I \right) \hat{\mathbf{\alpha}} = \mathbf{y}.
$$
For $\beta > 0$, this system is symmetric and positive definite and is solved with the CG-method, where multiplying with the kernel matrix $K$
$$
K \bf{\alpha} = \left[\sum_{j=1}^N \alpha_j\kappa\left(\bf{x}_i,\bf{x}_j\right)\right]_{i=1}^N \in\mathbb{R}^N
$$
is the most expensive step, since computing one matrix-vector product is required in every iteration of the CG-method.

We propose to not solve this directly but to approximate $K \bf{\alpha}$. For this, we make use of the *extended Gaussian ANOVA kernel*
$$
K = \left( \kappa_{ij} \right)_{i,j = 1}^{N} \in \mathbb{R}^{N \times N}, \quad \kappa_{ij} = \sum_{l=1}^P \frac{1}{P} \exp \left( - \frac{ \| \bf{x}_{i}^{\mathcal{W}_l} - \bf{x} _{j}^{\mathcal{W}_l} \|_2^2}{\sigma^2} \right),
$$
where $\sigma$ is a shape parameter, $P$ is the number of kernels to combine and $\mathcal{W}_l = \{ w_1^l, w_2^l, w_3^l \} \in \{ 1, \dots, d \}^3$ are the considered index sets, so that $\bf{x}_i^{\mathcal{W}_l}$ and $\bf{x}_j^{\mathcal{W}_l}$ are the data points restricted to the corresponding features. By this, we have a weighted sum of multiple kernels, where every kernel relies on not more than 3 features and, thus, we can apply the NFFT-based fast summation approach and use the [`fastadj`](https://github.com/dominikalfke/FastAdjacency) package by Dominik Alfke to speed up the kernel-vector multiplication.

First, we load the data set, we want to work with.

In [21]:
import numpy as np
from sklearn.model_selection import train_test_split

N, d = 20000, 15
rng = np.random.RandomState(0)
X = rng.randn(N, d)
y = np.sign(rng.randn(N))
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, random_state=42)

#### **Compare NFFTKernelRidge Classifier with sklearn KRR and sklearn SVC**

We set up NFFT-based kernel ridge regression with the `NFFTKernelRidge` class. The second class `GridSearch` enables us to perform GridSearch for the NFFTKernelRidge classifier and the sklearn classifiers KRR and SVC.

For this, we must define a `param_grid` with candidate parameter values, which shall be tried within GridSearch. In the following example, the grids for all classifiers are equivalent.

In [22]:
# import functions from extracted files
from nfft_kernel_ridge import NFFTKernelRidge, GridSearch

## GridSearch with NFFTKernelRidge
param_grid = {
    "sigma": [0.001, 0.01, 0.1, 1, 10, 100, 1000],
    "beta": [1, 10, 100, 1000],
}

model = GridSearch(classifier="NFFTKernelRidge", param_grid=param_grid, balance=False, norm=None)
results = model.tune(X_train, y_train, X_test, y_test)

print("\nGridSearch for NFFTKernelRidge")
print("Best Parameters:", results[0])
print("Best Result:", results[1])
print("Best Runtime Fit:", results[2])
print("Best Runtime Predict:", results[3])
print("Mean Runtime Fit:", results[4])
print("Mean Runtime Predict:", results[5])
print("Mean Total Runtime:", results[6])


## GridSearch with sklearn KRR
param_grid = {
    "alpha": [1, 10, 100, 1000],
    "gamma": [1/((0.001)**2), 1/((0.01)**2), 1/((0.1)**2), 1/((1)**2), 1/((10)**2), 1/((100)**2), 1/((1000)**2)],
}

model = GridSearch(classifier="sklearn KRR", param_grid=param_grid)
results = model.tune(X_train, y_train, X_test, y_test)

print("\nGridSearch for sklearn KRR")
print("Best Parameters:", results[0])
print("Best Result:", results[1])
print("Best Runtime Fit:", results[2])
print("Best Runtime Predict:", results[3])
print("Mean Runtime Fit:", results[4])
print("Mean Runtime Predict:", results[5])
print("Mean Total Runtime:", results[6])


## GridSearch with sklearn SVC
param_grid = {
    "C": [1/1, 1/10, 1/100, 1/1000], # (C = 1/alpha)
    "gamma": [1/((0.001)**2), 1/((0.01)**2), 1/((0.1)**2), 1/((1)**2), 1/((10)**2), 1/((100)**2), 1/((1000)**2)],
}

model = GridSearch(classifier="sklearn SVC", param_grid=param_grid)
results = model.tune(X_train, y_train, X_test, y_test)

print("\nGridSearch for sklearn SVC")
print("Best Parameters:", results[0])
print("Best Result:", results[1])
print("Best Runtime Fit:", results[2])
print("Best Runtime Predict:", results[3])
print("Mean Runtime Fit:", results[4])
print("Mean Runtime Predict:", results[5])
print("Mean Total Runtime:", results[6])


GridSearch for NFFTKernelRidge
Best Parameters: (0.001, 1000)
Best Result: (0.5052, 0.5013201901073755, 0.5738464638323595)
Best Runtime Fit: 1.2237987518310547
Best Runtime Predict: 0.18867278099060059
Mean Runtime Fit: 1.8293311425617762
Mean Runtime Predict: 0.20490669352667673
Mean Total Runtime: 2.034237836088453

GridSearch for sklearn KRR
Best Parameters: (100, 0.01)
Best Result: (0.5019, 0.49855815443768026, 0.6270400967156962)
Best Runtime Fit: 4.055310964584351
Best Runtime Predict: 1.1931085586547852
Mean Runtime Fit: 4.183920519692557
Mean Runtime Predict: 1.4633090921810694
Mean Total Runtime: 5.647229611873627

GridSearch for sklearn SVC
Best Parameters: (1.0, 0.01)
Best Result: (0.4998, 0.4971702220287331, 0.690308281281483)
Best Runtime Fit: 2.1764395236968994
Best Runtime Predict: 1.602280616760254
Mean Runtime Fit: 3.8639715228761946
Mean Runtime Predict: 2.420388732637678
Mean Total Runtime: 6.284360255513873
