# L4a: Kernel Functions and Kernel Regression
In this lecture, we will explore kernel functions and their application in kernel regression. Kernel methods allow us to model complex, nonlinear relationships in data by implicitly mapping inputs into high-dimensional feature spaces.

> __Learning Objectives:__
>
> By the end of this lecture, you should be able to:
>
> * __Kernel Function Definition:__ Understand what kernel functions are, their role as similarity measures, and the mathematical requirements (symmetry, positive semi-definiteness) for valid kernels.
> * __Gram Matrix and Positive Semi-Definiteness:__ Learn how to construct Gram matrices from data and verify that kernel matrices satisfy positive semi-definiteness constraints.
> * __Kernel Ridge Regression:__ Understand how kernel functions enable non-parametric regression through the "kernel trick" and how to derive the dual formulation with α coefficients.


Let's get started!
___

## Example
Today, we will use the following examples to illustrate key concepts:
 
> [▶ Can we estimate the similarity of different firms?](CHEME-5820-L4a-Example-MeasureFirmSimilarityScores-Spring-2026.ipynb). In this example, let's explore how to measure the similarity between different firms based upon the similarity of their daily growth rates over 10-year periods. Does this similarity correlate with other firm metrics, e.g., business sector, market capitalization, etc.?
___

## Kernel Functions
Kernel functions in machine learning are mathematical tools that enable algorithms to operate in high-dimensional spaces without explicitly computing the coordinates in those spaces. _Huh??_

> __What are kernel functions__? 
>
> A kernel function $k:\mathbb{R}^{m}\times\mathbb{R}^{m}\to\mathbb{R}$ takes a pair of vectors $\mathbf{v}_i\in\mathbb{R}^{m}$ and $\mathbf{v}_j\in\mathbb{R}^{m}$ as arguments, 
> e.g., a pair of feature vectors, a feature vector and a parameter vector, or any two vectors of compatible size , computes a scalar value that represents the similarity (in some sense) between the two vector arguments. There are many different kernel functions, each with its own properties and applications. To be a valid kernel function, $k(\cdot, \cdot)$ must satisfy certain mathematical properties, such as being symmetric and positive semi-definite.


Common kernel functions include the linear kernel, which is the dot product between two vectors $k(\mathbf{z}_i, \mathbf{z}_j) = \mathbf{z}_i^{\top}\mathbf{z}_j$, the polynomial kernel is defined as: $k_{d}(\mathbf{z}_i, \mathbf{z}_j) = (1+\mathbf{z}_i^{\top}\mathbf{z}_j)^d$ and the radial basis function (RBF) kernel $k_{\gamma}(\mathbf{z}_i, \mathbf{z}_j) = \exp\left(-\gamma \left\|\mathbf{z}_i - \mathbf{z}_j\right\|_{2}^{2}\right)$ where $\gamma>0$ is a scaling factor, and $\left\|\cdot\right\|_{2}^{2}$ is the squared Euclidean norm

__What do kernel functions do__? 

Kernel functions are similarity measures, i.e., they quantify the similarity between pairs of points in _some_ high-dimensional space. For example, we could use a kernel function to measure the similarity between augmented feature vectors $\hat{\mathbf{x}}_{i}$ and $\hat{\mathbf{x}}_{j}$, or we could construct higher dimensional versions of the features, for example, by including $x^{2}, x^{3}, \dots,$ terms in addition the original features. Increasing the dimension is [sometimes referred to as feature engineering](https://en.wikipedia.org/wiki/Feature_engineering), i.e., transforming raw data into a more effective set of features.

Let's look at a couple of example kernels that are [exported by the `KernelFunctions.jl` package](https://github.com/JuliaGaussianProcesses/KernelFunctions.jl) and apply them to a financial dataset.

> __Example__ 
> 
> If two tickers are similar industries, e.g., `NVDA` and `AMD`, or `GS` and `JPM`, then they should have a high similarity score. However, tickers can be anti-correlated, e.g., `SPY` or `QQQ` and `GLD`. Anticorrelated firms will have low similarity scores. Let's see how different kernel functions capture these relationships.
> 
> [▶ Can we estimate the similarity of different firms?](CHEME-5820-L4a-Example-MeasureFirmSimilarityScores-Spring-2026.ipynb). In this example, let's explore how to measure the similarity between different firms based upon the similarity of their daily growth rates over 10-year periods. Does this similarity correlate with other firm metrics, e.g., business sector, market capitalization, etc.?

___

### Can any function be a kernel function?

While kernel functions are powerful similarity measures, not all functions can serve as kernels. There are important mathematical constraints. 

> __Rules for a valid kernel function__: 
> 
> A function $k:\mathbb{R}^{m}\times\mathbb{R}^{m}\to\mathbb{R}$ is a _valid kernel function_ if and only if the kernel matrix $\mathbf{K}\in\mathbb{R}^{n\times{n}}$ is positive semidefinite for all possible choices of $n$ data vectors $\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_n \in \mathbb{R}^{m}$, where the kernel matrix elements are $K_{ij} = k(\mathbf{v}_i, \mathbf{v}_j)$.
This is equivalent to saying that all eigenvalues of the kernel matrix $\mathbf{K}$ are non-negative (and real).
Further, for any real valued vector $\mathbf{x}$, the kernel matrix $\mathbf{K}$ must satisfy $\mathbf{x}^{\top}\mathbf{K}\mathbf{x} \geq 0$.

__The Gram matrix as a special case of the kernel matrix__: A [Gram matrix](https://en.wikipedia.org/wiki/Gram_matrix) is a specific kernel matrix where the kernel function is the inner product. That is, for a set of $n$ vectors $\mathbf{v}_{1},\mathbf{v}_{2},\dots,\mathbf{v}_{n} \in \mathbb{R}^{m}$, a Gram matrix $\mathbf{K}$ is an $n\times{n}$ matrix with elements:
$$K_{ij} = k(\mathbf{v}_i, \mathbf{v}_j) = \left\langle\mathbf{v}_{i},\mathbf{v}_{j}\right\rangle$$
where $\left\langle\cdot,\cdot\right\rangle$ denotes an inner product (e.g., the standard Euclidean inner product $\mathbf{v}_i^{\top}\mathbf{v}_j$). If the vectors $\mathbf{v}_{1},\mathbf{v}_{2},\dots,\mathbf{v}_{n}$ are the _columns_ of the matrix $\mathbf{X} \in \mathbb{R}^{m \times n}$, then the Gram matrix is computed as $\mathbf{K} = \mathbf{X}^{\top}\mathbf{X}$ (which produces an $n \times n$ matrix). The key insight: **a Gram matrix is always symmetric and positive semidefinite by construction**, which means the linear (inner-product) kernel always satisfies the validity requirements.

Let's check these properties using our financial data. Let's compute [the Gram matrix](https://en.wikipedia.org/wiki/Gram_matrix) $\mathbf{K}$, and then compute its eigendecomposition [using the `eigen(...)` method exported by the `LinearAlgebra.jl` package](https://docs.julialang.org/en/v1/stdlib/LinearAlgebra/#LinearAlgebra.eigen).

___

## Kernel regression

Now that we understand what makes a valid kernel, let's see how kernels enable nonlinear regression through a technique called kernel ridge regression.

Kernel regression is a non-parametric technique to model (potentially) non-linear relationships between variables. It moves away from having one _global_ model to describe data and instead uses a weighted combination of many _local_ (potentially) nonlinear models to describe the data.

> __How does it work__? Kernel regression uses a _kernel function_ to assign weights to new (unseen) data points based on their proximity (similarity) to a known data, allowing for estimating a smooth curve or function that describes the relationship between dependent and independent variables. Thus, we shift to a weighted combination of many _local_ models.

Suppose we have a dataset $\mathcal{D} = \{(\mathbf{x}_{i},y_{i}) \mid i = 1,2,\dots,n\}$, where the features $\mathbf{x}_i \in \mathbb{R}^{m}$ are $m$-dimensional vectors ($m\ll{n}$) and the target variables are continuous values $y_i \in \mathbb{R}$, e.g., the price of a house, the price of a stock, the temperature, etc. We can model this as a linear regression problem:
$$
\hat{\mathbf{y}} = \hat{\mathbf{X}}\theta
$$
where $\hat{\mathbf{X}}$ is a data matrix with the transpose of the augmented feature vectors $\hat{\mathbf{x}}^{\top}$ on the rows, and $\theta$ is an unknown parameter vector $\theta\in\mathbb{R}^{p}$ where $p = m+1$. The (regularized) least squares solution for the parameters $\theta$ is given by:
$$
\begin{align*}
\hat{\mathbf{\theta}}_{\lambda} = \left(\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}}+\lambda\,\mathbf{I}\right)^{-1}\hat{\mathbf{X}}^{\top}\mathbf{y}
\end{align*}
$$

#### Kernel ridge regression
The basic idea of kernel regression is to rewrite the parameter vector $\hat{\theta}_{\lambda}$ as a sum of the _augmented feature variables_: $\hat{\theta}_{\lambda} \equiv \sum_{i=1}^{n}\alpha_{i}\hat{\mathbf{x}}_{i}$. This representation shifts our perspective: instead of storing a fixed parameter vector, we store a weighted combination of training examples. This is the foundation of the kernel trick.

> __Why this representation matters__: By expressing the parameters as a sum of training examples, we enable predictions that depend only on inner products between data points. This allows us to replace explicit inner products with kernel functions, enabling implicit feature engineering without actually constructing high-dimensional features.

Then for some (new) augmented feature vector $\hat{\mathbf{z}}$, the predicted output $\hat{y}$ is given by:
$$
\begin{align*}
\hat{y} & = \hat{\mathbf{z}}^{\top}\theta = \sum_{i=1}^{n}\alpha_{i}\left\langle\hat{\mathbf{z}},\hat{\mathbf{x}}_{i}\right\rangle\quad\mid\text{\,Replace inner product with kernel}\\
        & = \sum_{i=1}^{n}\alpha_{i}\,k(\hat{\mathbf{z}},\hat{\mathbf{x}}_{i})
\end{align*}
$$

To find the coefficients $\alpha_i$, we use the Gram matrix of the training data. Recall that the kernel matrix $\mathbf{K}^{\prime} = \hat{\mathbf{X}}\hat{\mathbf{X}}^{\top}$ is precisely a Gram matrix formed from the augmented feature vectors. We use $\mathbf{K}^{\prime}$ (rather than $\mathbf{K}$) here because our data matrix $\hat{\mathbf{X}}$ has augmented feature vectors as **rows**, whereas the Gram matrix $\mathbf{K}$ defined earlier used vectors as **columns**. 

Both are Gram matrices—just computed with different data matrix orientations. This Gram matrix is symmetric and positive semi-definite by construction, which will be important for the mathematical properties of the regression solution.

The two expression for $\hat{\theta}_{\lambda}$ can be equated:
$$
\begin{align*}
\left(\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}}+\lambda\,\mathbf{I}\right)^{-1}\hat{\mathbf{X}}^{\top}\mathbf{y} &= \hat{\mathbf{X}}^{\top}\alpha\\
\hat{\mathbf{X}}\left(\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}}+\lambda\,\mathbf{I}\right)^{-1}\hat{\mathbf{X}}^{\top}\mathbf{y} & = \hat{\mathbf{X}}\hat{\mathbf{X}}^{\top}\alpha\quad\mid\text{multiply left side by $\hat{\mathbf{X}}$} \\
\left(\hat{\mathbf{X}}\hat{\mathbf{X}}^{\top}+\lambda\,\mathbf{I}\right)^{-1}\hat{\mathbf{X}}\hat{\mathbf{X}}^{\top}\mathbf{y} & = \hat{\mathbf{X}}\hat{\mathbf{X}}^{\top}\alpha\quad\mid\text{Woodbury identity: $\hat{\mathbf{X}}(\mathbf{A}+\lambda \mathbf{I})^{-1}\hat{\mathbf{X}}^{\top} = (\hat{\mathbf{X}}\hat{\mathbf{X}}^{\top}+\lambda \mathbf{I})^{-1}\hat{\mathbf{X}}\hat{\mathbf{X}}^{\top}$}\\
\left(\hat{\mathbf{X}}\hat{\mathbf{X}}^{\top}+\lambda\,\mathbf{I}\right)^{-1}\hat{\mathbf{X}}\hat{\mathbf{X}}^{\top}\mathbf{y} & = \hat{\mathbf{X}}\hat{\mathbf{X}}^{\top}\alpha\quad\mid\text{substitute $\mathbf{K}^{\prime} = \hat{\mathbf{X}}\hat{\mathbf{X}}^{\top}$}\\
\left(\mathbf{K}^{\prime}+\lambda\,\mathbf{I}\right)^{-1}\mathbf{K}^{\prime}\mathbf{y} & = \mathbf{K}^{\prime}\alpha\quad\mid\text{multiply both sides by $(\mathbf{K}^{\prime}+\lambda \mathbf{I})$ on the left}\\
\mathbf{K}^{\prime}\mathbf{y} & = \left(\mathbf{K}^{\prime}+\lambda\,\mathbf{I}\right)\mathbf{K}^{\prime}\alpha\quad\mid\text{expand and solve for $\alpha$}\\
\mathbf{K}^{\prime}\mathbf{y} & = \mathbf{K}^{\prime}(\mathbf{K}^{\prime}\alpha + \lambda\alpha)\\
\mathbf{y} & = \mathbf{K}^{\prime}\alpha + \lambda\alpha = (\mathbf{K}^{\prime}+\lambda\mathbf{I})\alpha\\
\alpha &= \left(\mathbf{K}^{\prime}+\lambda\mathbf{I}\right)^{-1}\mathbf{y}
\end{align*}
$$
where $\mathbf{K}^{\prime} = \hat{\mathbf{X}}\hat{\mathbf{X}}^{\top}$ is the Gram matrix of the augmented features (which is symmetric and positive semi-definite, as guaranteed by the kernel validity properties learned above), the matrix $\mathbf{I}$ denotes the identity matrix, the vector $\mathbf{y}$ denotes the observed outputs and $\lambda\geq{0}$ denotes the regularization parameter. 

> __Technical Note__:
>
> Here's the key insight: In the derivation above, we used the linear kernel Gram matrix $\mathbf{K}^{\prime} = \hat{\mathbf{X}}\hat{\mathbf{X}}^{\top}$, which is simply the special case where $k(\hat{\mathbf{x}}_i, \hat{\mathbf{x}}_j) = \hat{\mathbf{x}}_i^{\top}\hat{\mathbf{x}}_j$. Notice that the final regression solution $\alpha = (\mathbf{K}^{\prime}+\lambda\mathbf{I})^{-1}\mathbf{y}$ depends **only** on kernel evaluations $k(\hat{\mathbf{x}}_i, \hat{\mathbf{x}}_j)$, never on the explicit feature coordinates themselves. 
>
> This means we can substitute **any valid kernel function** (polynomial, RBF, custom kernels) into $\mathbf{K}^{\prime}$ and use the exact same algorithm without any changes. The algorithm structure remains unchanged; we've simply replaced the linear kernel with a different similarity measure.
>
> Additionally, the [Woodbury matrix identity](https://en.wikipedia.org/wiki/Woodbury_matrix_identity) enables the transformation from $(\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}}+\lambda \mathbf{I})^{-1}$ (which scales with feature dimension) to $(\hat{\mathbf{X}}\hat{\mathbf{X}}^{\top}+\lambda \mathbf{I})^{-1}$ (which scales with sample size), making kernel ridge regression computationally tractable for large feature dimensions.

___

## The Kernel Trick

Notice the substitution that occurred in the derivation above:
$$\hat{y} = \sum_{i=1}^{n}\alpha_{i}\left\langle\hat{\mathbf{z}},\hat{\mathbf{x}}_{i}\right\rangle$$

Then we simply replaced the inner products with a kernel function:
$$\hat{y} = \sum_{i=1}^{n}\alpha_{i}\,k(\hat{\mathbf{z}},\hat{\mathbf{x}}_{i})$$

This is the **kernel trick** in action. A key insight is that we can implicitly work in high-dimensional feature spaces without explicitly constructing the features.

> __The Kernel Trick__
>
> When we express predictions as sums of inner products between data points, we never actually need to know the individual vectors $\hat{\mathbf{z}}$ and $\hat{\mathbf{x}}_{i}$. We only need to compute their inner product. If we can define a function $k(\cdot, \cdot)$ that computes this inner product (or approximates it in some transformed feature space), we've effectively performed feature engineering _without constructing the features explicitly_. This is the kernel trick: replace all inner products with kernel evaluations.

This approach has a critical property: the algorithm structure never changes. We always solve $\alpha = (\mathbf{K}^{\prime}+\lambda\mathbf{I})^{-1}\mathbf{y}$ using a kernel matrix. By choosing different kernels (linear, polynomial, RBF), we automatically adapt to different feature spaces while maintaining the same mathematical framework.
___

## Summary
Kernel functions are similarity measures that enable non-parametric modeling of complex relationships without explicitly constructing high-dimensional feature representations.

> __Key Takeaways:__
>
> * __Valid kernels must be symmetric and positive semi-definite__, ensuring the corresponding Gram matrices satisfy important mathematical properties for all possible data choices.
> * __The kernel trick__ allows us to work implicitly in high-dimensional spaces by replacing inner products $\langle\mathbf{z}_i, \mathbf{z}_j\rangle$ with kernel evaluations $k(\mathbf{z}_i, \mathbf{z}_j)$, enabling scalable algorithms.
> * __Kernel ridge regression__ reformulates linear regression using a dual representation with coefficients α, shifting from a model-centric view (fitting θ) to a data-centric view (weighted training examples).


Kernel methods form the mathematical foundation for support vector machines (SVMs), Gaussian processes, and other modern machine learning algorithms.

___