# L4c: Kernel Functions and Kernel Regression
In this lecture, we will explore kernel functions and their application in kernel regression. Kernel methods allow us to model complex, nonlinear relationships in data by implicitly mapping inputs into high-dimensional feature spaces.

> __Learning Objectives:__
>
> By the end of this lecture, you should be able to:
>
> Three key learning objectives go here

Let's get started!
___

## Example
Today, we will use the following examples to illustrate key concepts:
 
> [â–¶ K-means clustering on a consumer spending dataset](CHEME-5820-L1c-Example-K-Means-Spring-2026.ipynb). In this example, we apply Lloyd's algorithm to customer demographics and spending behavior. We'll observe how K-means partitions customers into distinct segments, visualize the cluster assignments, and examine how centroid placement affects the final groupings.

___

## Kernel Functions
Kernel functions in machine learning are mathematical tools that enable algorithms to operate in high-dimensional spaces without explicitly computing the coordinates in those spaces. _Huh??_

* __What are kernel functions__? A kernel function $k:\mathbb{R}^{\star}\times\mathbb{R}^{\star}\to\mathbb{R}$ takes a pair of vectors $\mathbf{v}_i\in\mathbb{R}^{\star}$ and $\mathbf{v}_j\in\mathbb{R}^{\star}$ as arguments, 
e.g., a pair of feature vectors, a feature vector and a parameter vector, or any two vectors of compatible size 
, computes a scalar value that represents the similarity (in some sense) between the two vector arguments.
* __Common kernel functions__: Common kernel functions include the linear kernel, which is the dot product between two vectors $k(\mathbf{z}_i, \mathbf{z}_j) = \mathbf{z}_i^{\top}\mathbf{z}_j$, the polynomial kernel is defined as: $k_{d}(\mathbf{z}_i, \mathbf{z}_j) = (1+\mathbf{z}_i^{\top}\mathbf{z}_j)^d$ and the radial basis function (RBF) kernel $k_{\gamma}(\mathbf{z}_i, \mathbf{z}_j) = \exp(-\gamma \lVert\mathbf{z}_i - \mathbf{z}_j\rVert_{2}^2)$ where $\gamma>0$ is a scaling factor, and $\lVert\cdot\rVert^{2}_{2}$ is the squared Euclidean norm
* __What do kernel functions do__? Kernel functions are similarity measures, i.e., they quantify the similarity between pairs of points in _some_ high-dimensional space. For example, we could use a kernel function to measure the similarity between augmented feature vectors $\hat{\mathbf{x}}_{i}$ and $\hat{\mathbf{x}}_{j}$, or we could construct higher dimensional versions of the features, for example, by including $x^{2}, x^{3}, \dots,$ terms in addition the original features. Increasing the dimension is [sometimes referred to as feature engineering](https://en.wikipedia.org/wiki/Feature_engineering), i.e., transforming raw data into a more effective set of features.

Let's look at a couple of example kernels that are [exported by the `KernelFunctions.jl` package](https://github.com/JuliaGaussianProcesses/KernelFunctions.jl) and apply them to our financial dataset.

> __Experiment__: If two tickers are similar industries, e.g., `NVDA` and `AMD`, or `GS` and `JPM`, then they should have a high similarity score. However, if tickers can be anti-correlated, e.g., `SPY` or `QQQ` and `GLD`. Anticorrelated firms will have low similarity scores.

### Can any function be a kernel function?
No! Not all functions can be kernel functions; there are some rules. 
* __Rules for a valid kernel function__: A function $k:\mathbb{R}^{\star}\times\mathbb{R}^{\star}\to\mathbb{R}$ is a _valid kernel function_ if and only if the kernel matrix $\mathbf{K}\in\mathbb{R}^{m\times{m}}$ is positive semidefinite for all possible choices of the data vectors $\mathbf{v}_i$, where $K_{ij} = k(\mathbf{v}_i, \mathbf{v}_j)$.
This is equivalent to saying that all eigenvalues of the kernel matrix $\mathbf{K}$ are non-negative (and real).
Further, for any real valued vector $\mathbf{x}$, the Kernel matrix $\mathbf{K}$ must satisfy $\mathbf{x}^{\top}\mathbf{K}\mathbf{x} \geq 0$. Finally, when the kernel function is an (untransformed) inner product, the Kernel matrix is equal to [the Gram matrix](https://en.wikipedia.org/wiki/Gram_matrix).

* __What is the Gram matrix__? For a set of $n$-vectors, $\mathbf{v}_{1},\mathbf{v}_{2},\dots,\mathbf{v}_{m}$, [the Gram matrix](https://en.wikipedia.org/wiki/Gram_matrix) $\mathbf{K}$ is a $m\times{m}$ matrix with elements  $K_{ij}=\left<\mathbf{v}_{i},\mathbf{v}_{j}\right>$, where $\left<\star,\star\right>$ denotes an inner product. If the vectors $\mathbf{v}_{1},\mathbf{v}_{2},\dots,\mathbf{v}_{m}$ are the _columns_ of the matrix $\mathbf{X}$, then [the Gram matrix](https://en.wikipedia.org/wiki/Gram_matrix) is given by $\mathbf{K} = \mathbf{X}^{\top}\mathbf{X}$ (assuming all the entries in $\mathbf{X}$ are real). The Gram matrix is symmetric and positive semidefinite.

Let's check these properties using our financial data. Let's compute [the Gram matrix](https://en.wikipedia.org/wiki/Gram_matrix) $\mathbf{K}$, and then compute its eigendecomposition [using the `eigen(...)` method exported by the `LinearAlgebra.jl` package](https://docs.julialang.org/en/v1/stdlib/LinearAlgebra/#LinearAlgebra.eigen).

## Kernel regression
Kernel regression is a non-parametric technique to model (potentially) non-linear relationships between variables. It moves away from having one _global_ model to describe data and instead uses a weighted combination of many _local_ (potentially) nonlinear models to describe the data.

> __How does it work__? Kernel regression uses a _kernel function_ to assign weights to new (unseen) data points based on their proximity (similarity) to a known data, allowing for estimating a smooth curve or function that describes the relationship between dependent and independent variables. Thus, we shift to a weighted combination of many _local_ models.

Suppose we have a dataset $\mathcal{D} = \{(\mathbf{x}_{i},y_{i}) \mid i = 1,2,\dots,n\}$, where the features $\mathbf{x}_i \in \mathbb{R}^{m}$ are $m$-dimensional vectors ($m\ll{n}$) and the target variables are continuous values $y_i \in \mathbb{R}$, e.g., the price of a house, the price of a stock, the temperature, etc. We can model this as a linear regression problem:
$$
\hat{\mathbf{y}} = \hat{\mathbf{X}}\theta
$$
where $\hat{\mathbf{X}}$ is a data matrix with the transpose of the augmented feature vectors $\hat{\mathbf{x}}^{\top}$ on the rows, and $\theta$ is an unknown parameter vector $\theta\in\mathbb{R}^{p}$ where $p = m+1$. The (regularized) least squares solution for the parameters $\theta$ is given by:
$$
\begin{equation}
\hat{\mathbf{\theta}}_{\lambda} = \left(\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}}+\lambda\,\mathbf{I}\right)^{-1}\hat{\mathbf{X}}^{\top}\mathbf{y}
\end{equation}
$$

#### Kernel ridge regression
The basic idea of kernel regression is to rewrite the parameter vector $\hat{\theta}_{\lambda}$ as a sum of the _augmented feature variables_: $\hat{\theta}_{\lambda} \equiv \sum_{i=1}^{n}\alpha_{i}\hat{\mathbf{x}}_{i}$. Then for some (new) feature vector $\hat{\mathbf{z}}$,  the predicted output $\hat{y}$ is given by:
$$
\begin{align}
\hat{y} & = \hat{\mathbf{z}}^{\top}\theta = \sum_{i=1}^{n}\alpha_{i}\left<\hat{\mathbf{z}},\mathbf{x}_{i}\right>\quad\mid\text{\,Replace inner product with kernel}\\
        & = \hat{\mathbf{z}}^{\top}\theta \simeq \sum_{i=1}^{n}\alpha_{i}\,k(\hat{\mathbf{z}},\mathbf{x}_{i})
\end{align}
$$
where $k(\hat{\mathbf{z}},\mathbf{x}_{i})$ denotes a kernel function (similarity score) between a new (augmented) feature vector and $\hat{\mathbf{z}}$ and the (known) training feature vector $\hat{\mathbf{x}}_{i}$. We need to estimate the $\alpha_{i}$ parameters; however, this is not as hard as it may first appear.

__How are $\alpha$ and $\theta$ related__? 
The two expression for $\hat{\theta}_{\lambda}$ can be equated:
$$
\begin{equation}
\left(\hat{\mathbf{X}}^{\top}\hat{\mathbf{X}}+\lambda\,\mathbf{I}\right)^{-1}\hat{\mathbf{X}}^{\top}\mathbf{y} = \hat{\mathbf{X}}^{\top}\alpha
\end{equation}
$$
After some algebraic manipulation that [is shown in the course notes](https://github.com/varnerlab/CHEME-5820-Lectures-Spring-2025/blob/main/lectures/week-4/L4a/docs/Notes.pdf), this expression can be solved for the expansion coefficients:
$$
\begin{equation}
\alpha = \left(\mathbf{K}^{\prime}+\lambda\mathbf{I}\right)^{-1}\mathbf{y}
\end{equation}
$$
where $\mathbf{K}^{\prime} = \hat{\mathbf{X}}\hat{\mathbf{X}}^{\top}$, the matrix $\mathbf{I}$ denotes the identity matrix, the vector $\mathbf{y}$ denotes the observed outputs and $\lambda\geq{0}$ denotes the regularization parameter. 

__Hmmm__: That's interesting! In lab tomorrow, we are going to build one of these to model stock prices.

## Summary
One direct, concise summary sentence goes here.

> __Key Takeaways:__
>
> Three key takeaways go here

One direct, concise concluding sentence goes here.
___