# Kernel regressions

Kernel regression is a technique for non-parametric estimation of regression models, this means that the method does not make assumptions about the distribution of the data. Unlike linear regression, which estimates a single constant coefficient for each predictor variable, kernel regression estimates a smooth function of the predictor variables. It does it through the kernel, which is a function that determines how similar two points are and estimates new values based on the values from nearby points. This makes it well suited for data sets where the underlying relationship between the dependent and independent variables is non-linear. This method can be applied to curve fitting, prediction, classification, and density estimation. It is also well-suited for data with discontinuities or nonlinear structure. In general, any situation where you need to estimate an unknown function from data can benefit from using kernel regression. There exist several methods, some of them are listed below.

The **Watson estimator** is used for data with high-dimensional features or with highly correlated variables. The kernel $K_h$ for a bandwidth $h$ is used in the expression
$$
\hat{m}_h(x) = \frac{\sum_{i=1}^n K_h (x - x_i)\, y_i}{\sum_{i=1}^n K_h (x - x_i)}.
$$

The **Priestley-Chao** kernel, on the other hand, is known for being translation invariant. This means that the estimates produced by the regression will be unchanged if the data points are shifted by any constant amount. This is especially important in applications where data may be subject to measurement error or other sources of imprecision, as it ensures that these errors will not affect the results of the analysis. It is calculated with
$$
\hat{m}_\mathrm{PC}(x) = h^{-1} \sum_{1=2}^n (x_i - x_{i-1}) K \left( \frac{x - x_i}{h} \right)\, y_i,
$$
where $h$ is the bandwidth or smoothing parameter.

Finally, the **Gasser-Müller** kernel is considered to be more efficient than other types of kernel functions. It has been shown to be more accurate when predicting outcomes for data sets with a high degree of variance. Additionally, the Gasser kernel can handle non-linear relationships between input data points better than other kernel functions. This makes it a popular choice for use in machine learning applications.
$$
\hat{m}_\mathrm{GM}(x) = h^{-1} \sum_{1=1}^n \left[ \int_{s_{i-1}}^{s_i} K \left( \frac{x - u}{h} \right)\, du \right]\, y_i,
$$
where $s_i = (x_{i-1} + x_i)/2$.

## Overview

We will explore representative examples of the gaussian kernel for regression applications using the 1D and its multi-dimensional generalization.

# Libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt

from scipy.stats             import norm
from scipy.stats             import multivariate_normal

from sklearn.kernel_ridge    import KernelRidge
from sklearn.model_selection import train_test_split
from sklearn.metrics         import mean_squared_error

## 1. Gaussian kernel regression

In Gaussian kernel regression the shape of the kernel is the Gaussian curve:
$$
\frac{1}{\sqrt{2\,\pi}} \exp \left[ - \frac{z^2}{2} \right].
$$

Each constructed kernel describes a normal distribution with mean value ${\bf x}_i$ and standard deviation $b$, where $b$ is a hyperparameter that controls the width of the Gaussian:
$$
k(x, x_i) = \frac{1}{\sqrt{2\,\pi}} \exp \left[ - \frac{(x-x_i)^2}{2\,b^2} \right].
$$

Note that the normalization of the Gaussian does not matter as the weights are being normalized themselves.

The weights for a given new input $\tilde x$ are calculated from the normalized kernel values:
$$
w_i = \frac{k(\tilde x, x_i)}{\sum_{l=1}^N k(x_l, x_i)}.
$$

The prediction $\tilde y$ is obtained by multiplying the weight vector ${\bf w} = [w_1, w_2, \dots, w_N]$ with the label vector ${\bf y} = [y_1, y_2, \dots, y_N]$:
$$
\tilde y = \sum_{i=1}^N w_i\, y_i.
$$

### 1.1 One-dimensional regression

In [None]:
# Create a 1D dataset
X = np.array([10,20,30,40,50,60,70,80,90,100,110,120])
Y = np.array([2337,2750,2301,2500,1700,2100,1100,1750,1000,1642,2000,1932])

# Plot the dataset
fig,ax=plt.subplots( figsize=(8,8) )

plt.rc('xtick', labelsize=18) 
plt.rc('ytick', labelsize=18)

ax.set_xlabel('x', fontsize=18)
ax.set_ylabel('y', fontsize=18)

ax.set_title('Data',fontsize=20)

ax.scatter(X, Y, color='blue', label='Training')

plt.legend(fontsize=20)

plt.show()

For the sake of simplicity, we will create an object for the implementation of the linear regression model. The object will have the following methods:
    
- **kernel**          : This method will compute the Gaussian kernel for a given set of points.
- **predict**         : This method will compute the prediction for a given set of points.
- **plot_kerneles**   : This method will plot the Gaussian kernels.
- **plot_predictions**: This method will plot the predictions.

In [None]:
# Define a class for Gaussian Kernel Regression
class GaussianKernelRegression1D:
    #
    # Initialization
    #
    def __init__(self, x, y, b):

        self.x = np.asarray(x)
        self.y = np.asarray(y)
        self.b = b
    #
    # Implement the Gaussian Kernel
    #
    def kernel(self, z):

        return ( 1.0/np.sqrt(2.0*np.pi) )*np.exp(-0.5*z*z)
    #
    # Calculate weights and return prediction
    #
    def predict(self, X):

        kernels = [self.kernel( (xi-X)/self.b ) for xi in self.x]
        weights = [kernel/np.sum(kernels) for kernel in kernels]

        return np.dot(weights, self.y)
    #
    # Visualize the kernels
    #
    def plot_kernels(self, points=100):

        plt.figure( figsize = (12,6) )

        plt.title('Kernel', fontsize=20)

        plt.ylabel(r'Kernel Weights $w_i$', fontsize=18)
        plt.xlabel('x', fontsize=18)

        for xi in self.x:
            x_normal = np.linspace(xi - 3.0*self.b, xi + 3.0*self.b, num=points)
            y_normal = norm.pdf(x_normal, xi, self.b)

            plt.plot(x_normal, y_normal, label=r'Kernel at $x_i$={}'.format(xi))
            
        plt.legend(fontsize=14)
    #
    # Visualize the predictions
    #
    def plot_predictions(self, X, points=100):

        max_y = 0

        plt.figure( figsize = (12,6) )

        plt.title('Prediction', fontsize=20)

        plt.ylabel(r'Kernel Weights $w_i$', fontsize=18)
        plt.xlabel('x', fontsize=18)
        
        for xi in self.x:
            x_normal = np.linspace(xi - 3.0*self.b, xi + 3.0*self.b, num=points)
            y_normal = norm.pdf(x_normal, xi, self.b)

            max_y = max(max(y_normal), max_y)

            plt.plot(x_normal, y_normal, label=r'Kernel at $x_i$={}'.format(xi))
            
        plt.plot([X,X], [0, max_y], 'k-', lw=2, dashes=[2, 2])
        
        plt.legend(fontsize=14)

In [None]:
# Set the width of the Gaussian kernel
b = 20

# Create an instance of the GaussianKernelRegression class
gaussian_kernel_regression = GaussianKernelRegression1D(X, Y, b)
gaussian_kernel_regression.plot_kernels(points=100)

# Prediction for test x
x = 26.0
gaussian_kernel_regression.plot_predictions(x, points=200)

In [None]:
# Visualize the 1-dimensional prediction
xlist = np.linspace(0, 120, 240)
ylist = np.array([gaussian_kernel_regression.predict(x) for x in xlist])

fig,ax = plt.subplots( figsize=(8,8) )

ax.set_title('1D Gaussian Kernel',fontsize=20)

ax.set_xlabel('x',fontsize=18)
ax.set_ylabel('y',fontsize=18)

ax.scatter(X, Y, color='b', label='Training')
ax.plot(xlist, ylist, color='k', label='Prediction')

plt.legend(fontsize=16)

plt.show()

### 1.2 N-dimensional regression

For $N$-dimenisonal inputs, we need to calculate the kernels with the Euclidean metric instead.
$$
k(x, x_i) = \frac{1}{\sqrt{2\,\pi}} \exp \left ( - \frac{\|{\bf x}-{\bf x}_i\|^2}{2\,b^2} \right ).
$$

Once again, we will create an object for the implementation of the linear regression model. The object will have the following methods:
    
- **kernel**          : This method will compute the Gaussian kernel for a given set of points.
- **predict**         : This method will compute the prediction for a given set of points.
- **plot_kerneles**   : This method will plot the Gaussian kernels.

In [None]:
class GaussianKernelRegressionND:
    #
    # Initialization
    #
    def __init__(self, x, y, b):

        self.x = np.asarray(x)
        self.y = np.asarray(y)
        self.b = b
    #
    # Implement the Gaussian Kernel
    #
    def gaussian_kernel(self, z):

        return (1.0/np.sqrt(2.0*np.pi))*np.exp(-0.5*z*z)
    #
    # Calculate weights and return prediction
    #
    def predict(self, X):

        kernels = [self.gaussian_kernel( ( np.linalg.norm(xi-X) )/self.b ) for xi in self.x]
        weights = [(kernel/np.sum(kernels)) for kernel in kernels]

        weights = np.asarray(weights)

        return np.dot(weights.T, self.y)
    #
    # Visualize the kernels
    #
    def plot_kernels(self, points=100):
        zsum = np.zeros( (points,points) )

        plt.figure(figsize = (14,8))

        ax = plt.axes(projection = '3d')

        plt.rc('xtick', labelsize=14) 
        plt.rc('ytick', labelsize=14)

        ax.set_ylabel('y', fontsize=14)
        ax.set_xlabel('x', fontsize=14)
        ax.set_zlabel(r'Kernel Weights $w_i$', labelpad=5, fontsize=14)

        for xi in self.x:
            x, y  = np.mgrid[0:points:complex(0.0, points), 0:points:complex(0.0, points)]
            xy    = np.column_stack([x.flat, y.flat])
            z     = multivariate_normal.pdf(xy, mean=xi, cov=self.b)
            z     = z.reshape(x.shape)
            zsum += z
            
        ax.plot_surface(x, y, zsum)

        plt.show()

In [None]:
# Create a reference funcion to use for machine learning
def reference_function_3d(x=None, y=None, mesh=False):
    
    if mesh: x, y = np.meshgrid(x, y)

    z = ( x * np.exp(-x**2 - y**2) )

    if mesh: return x, y, z
    
    else: return z

# Generate a grid of points for a 3D function
points  = np.linspace(-2, 2, 51)

x, y, z = reference_function_3d(x=points, y=points, mesh=True)

In [None]:
# Create a seed for reproducibility
np.random.seed(23971)
number_of_points = 100

b = 0.25

x_random = np.random.uniform(-2, 2, number_of_points)
y_random = np.random.uniform(-2, 2, number_of_points)

# Plot function using a dense regular mesh
fig, (ax1, ax2, ax3) = plt.subplots( nrows=3, figsize=(10, 24) )

plt.subplots_adjust(hspace=0.2)

plt.rc('xtick', labelsize=18) 
plt.rc('ytick', labelsize=18)

ax1.set_title('Plot for dense mesh of points', fontsize=20)
ax2.set_title(f'Triangulation plot for '
              f'{number_of_points} random points', fontsize=20)
ax3.set_title('Gaussian Kernel Regression on '
              f'{number_of_points} random points', fontsize=20)

ax1.set(xlim=(-2, 2), ylim=(-2, 2))
ax2.set(xlim=(-2, 2), ylim=(-2, 2))
ax3.set(xlim=(-2, 2), ylim=(-2, 2))

# Dense mesh of points
ax1.contour(x, y, z, levels=14, linewidths=0.5, colors='k')

contour_ax1 = ax1.contourf(x, y, z, levels=14, cmap="RdBu_r")

fig.colorbar(contour_ax1, ax=ax1)

ax1.plot(x, y, 'ko', ms=1)

# Triangulation plot for random points
X_train = np.vstack((x_random, y_random)).T
Y_train = reference_function_3d(x_random, y_random)

ax2.tricontour(x_random, y_random, Y_train, levels=14, linewidths=0.5, colors='k')
contour_ax2 = ax2.tricontourf(x_random, y_random, Y_train, levels=14, cmap="RdBu_r")

fig.colorbar(contour_ax2, ax=ax2)

ax2.plot(x_random, y_random, 'ko', ms=5)

# Train Gaussian Kernel Regression on the random points
gaussian_kernel_regression = GaussianKernelRegressionND(X_train, Y_train, b)

x_flat = x.flatten()
y_flat = y.flatten()

z_flat = [gaussian_kernel_regression.predict( [x_val, y_val] )
          for  x_val, y_val in zip(x_flat, y_flat)]

ax3.tricontour(x_flat, y_flat, z_flat, levels=14, linewidths=0.5, colors='k')
contour_ax3 = ax3.tricontourf(x_flat, y_flat, z_flat, levels=14, cmap="RdBu_r")

fig.colorbar(contour_ax3, ax=ax3)
ax3.plot(x_random, y_random, 'ko', ms=5)

plt.show()

> ### Assignment
>
> How is the training affected by different choices of $b$?

In [None]:
# Define a different set of points
X = np.array([[ 11, 15],
              [ 22, 30],
              [ 33, 45],
              [ 44, 60],
              [ 50, 52],
              [ 67, 92],
              [ 78,107],
              [ 89,123],
              [100,137]])

Y = np.array([2337,2750,2301,2500,1700,2100,1100,1750,1000,1642,2000,1932])

b = 10

# Create an instance of the GaussianKernelRegression class
gaussian_kernel_regression = GaussianKernelRegressionND(X, Y, b)

plt.rc('xtick', labelsize=12)
plt.rc('ytick', labelsize=12)

gaussian_kernel_regression.plot_kernels(points=100)

> ### Assignment
>
> How are the weights modified for different values of the bandwidth parameter $b$?

## 2. Ridge kernel regression

This method uses the kernel to transform our dataset to the kernel space and then performs a linear regression in kernel-space. Therefore, one should always **choose the appropriate kernel for a problem**. Here, we will be using using a polynomial kernel for two vectors (two points in our one-dimensional example) ${\bf x}_1$ and ${\bf x}_2$ that is given by
$$
K({\bf x}_1, {\bf x}_2) = \left[ \gamma ({\bf x}_1^\mathrm{T}\, {\bf x}_2) + c \right]^d,
$$
where $\gamma$ is the kernel coefficient, $c$ is the independent term and $d$ is the degree of the polynomial. In this case, $\gamma$ and $c$ play a minor role, and their default value of 1.0 is adequate, so we will only focus on optimizing the polynomial degree $d$.

In [None]:
x.shape

In [None]:
# Create a seed for reproducibility
np.random.seed(seed=5)

data_points = 300
test_points = 1001

# Generate a data set for machine learning
x = np.linspace(-2, 2, num=data_points) + \
    np.random.normal(0.0, 0.3, data_points)
y = np.cos(x) - 2*np.sin(x) + 3*np.cos(x*2) + \
    np.random.normal(0.0, 1.0, data_points)

# Create list with points within the range of the dataset
x_pred = np.linspace(np.amin(x), np.amax(x), num=test_points, endpoint=True)
x_pred = np.array(x_pred).reshape(-1, 1)

# Split the dataset into 80% for training and 20% for testing
x = x.reshape( (x.size,1) )
x_train,x_test,y_train,y_test = train_test_split(x, y, train_size=0.8, shuffle=True)

# Plot the training and testing dataset
fig,ax=plt.subplots( figsize=(8,8) )

ax.set_title('Training and testing data',fontsize=20)

ax.set_xlabel('X values',fontsize=28)
ax.set_ylabel('cos(x)+2sin(x)+3cos(2x)',fontsize=18)

ax.scatter(x_train, y_train, color='#005AB5', marker="o", label='Training')
ax.scatter(x_test,  y_test,  color='#DC3220', marker="D", label='Testing')

plt.legend(loc='best', fontsize=18)
plt.show()

In [None]:
# Create a function to plot and compare results
def plot_comparison(degrees=None, regularization=None, reg_degree=4):

    # Create lists to store results
    y_pred = []

    training_rmse = []
    testing_rmse  = []

    training_predictions = []
    testing_predictions  = []

    # Shape of the plot
    rows, cols = 2, 2

    loop_vector = degrees if degrees is not None else regularization

    for choice in loop_vector:
        
        if degrees is not None:
            kernel_ridge_regression = KernelRidge(alpha=1.0,
                kernel='polynomial',
                degree=choice)
            
        if regularization is not None:
            kernel_ridge_regression = KernelRidge(alpha=choice,
                kernel='polynomial',
                degree=reg_degree)
            
        kernel_ridge_regression.fit(x_train, y_train)

        y_pred.append( kernel_ridge_regression.predict(x_pred) )

        pred_y_train = kernel_ridge_regression.predict(x_train)
        pred_y_test  = kernel_ridge_regression.predict(x_test)
        
        # Calculate training and testing errors
        training_predictions.append(pred_y_train)

        mean_error = np.sqrt(mean_squared_error(y_train, pred_y_train))
        training_rmse.append(mean_error)

        testing_predictions.append(pred_y_test)
        mean_error = np.sqrt(mean_squared_error(y_test, pred_y_test))
        testing_rmse.append(mean_error)

    # Plot the results for each polynomial degree
    fig, axs = plt.subplots( rows, cols, figsize=(12,12) )

    for ax in axs.flat:
        ax.set_xlabel('x', fontsize = 18)
        ax.set_ylabel('y', fontsize = 18)

        ax.label_outer()

        ax.scatter(x_train, y_train, color='#005AB5', marker='o', label='Training')
        ax.scatter(x_test,  y_test,  color='#DC3220', marker='D', label='Testing')

    idx = 0
    for row in range(rows):
        for col in range(cols):
            
            if degrees is not None:
                title_string = r'$d$ = {:2d}'.format( degrees[idx] )
                
            if regularization is not None:
                title_string = r'$\alpha$ = {:.0e}'.format( regularization[idx] )

            axs[row,col].set_title(title_string, fontsize=20)

            axs[row,col].plot(x_pred, y_pred[idx], color='black', lw=4)

            axs[row,col].annotate(f'Training RMSE = {training_rmse[idx]:.2f}',
                    xy=(0.2,0.2), xycoords='axes fraction', fontsize=14)
            
            axs[row,col].annotate(f'Testing RMSE = {testing_rmse[idx]:.2f}',
                    xy=(0.2,0.1), xycoords='axes fraction', fontsize=14)

            idx += 1

    plt.show()

In [None]:
plot_comparison( degrees=[1, 2, 3, 4] )

> ### Assignnment
>
> What happens to the RMSE in both training and testing as the degree of the polynomial increases?

### 2.1 Regularization parameter

The regularization paremeter, $\alpha$, should also be optimized. It controls the conditioning of the problem, and larger $\alpha$ values result into results that are more “general” and ignore the peculiarities of the problem. Larger values of $\alpha$ allow to ignore noise in the system, but this might result into the model being blind to actual trends of the data.

If we perform our kernel ridge regression for different $\alpha$ values, we can see its effect, as shown below.

In [None]:
plot_comparison( regularization=[1e-4, 1e0, 1e2, 1e4] )

> ### Assignment
>
> How do the testing RMSEs change as the polynomial degree increases?