# Lecture 3 - Linear Models for Regression with Basis Functions



In our **Polynomial Curve Fitting** example, the function is linear with respect to the parameters we are estimating. Thus, it is considered a  *linear regression model*. (However, it is *non-linear* with respect to the input variable, $x$)


<div class="alert alert-success">
    <b>Input Space</b> 

Suppose we are given a training set comprising of $N$ observations of $\mathbf{x}$, $\mathbf{x} = \left[x_1, x_2, \ldots, x_N \right]^T$, and its corresponding desired outputs $\mathbf{t} = \left[t_1, t_2, \ldots, t_N\right]^T$, where sample $x_i$ has the desired label $t_i$.  The input space is defined by the domain of $\mathbf{x}$.
</div>


* The polynomial curve fitting example can be rewritten as follows:

\begin{eqnarray}
t \sim y(x,\mathbf{w}) &=& w_0 + \sum_{j=1}^{M} w_j x^j\\
&=& \sum_{j=0}^{M} w_j \phi_j(x)\\
\end{eqnarray}
where
$$\phi_j(x) = x^j$$


* By modifying the function $\phi$ (known as a *basis function*), we can easily extend/modify the class of models being considered. 


<div class="alert alert-success">
    <b>Linear Basis Model</b> 

The linear basis model for regression takes linear combinations of fixed nonlinear functions of the input variables
$$t \sim y(\mathbf{x},\mathbf{w}) = \sum_{j=0}^{M} w_j\phi_j(\mathbf{x})$$
where $\mathbf{w} = \left[w_{0}, w_{1}, \ldots, w_{M}\right]^T$ and
$\mathbf{x} = \left[x_1, \ldots, x_D\right]^T$ 
</div>



* For all data observations $\{x_i\}_{i=1}^N$ and using the basis mapping defined as $\boldsymbol{\phi}(x_i) = \left[\begin{array}{ccccc} x_{i}^{0} & x_{i}^{1} & x_{i}^{2} & \cdots & x_{i}^{M}\end{array}\right]^T$, we can write the input data in a *matrix* form as:

$$\mathbf{X} = \left[\begin{array}{ccccc}
1 & x_{1} & x_{1}^{2} & \cdots & x_{1}^{M}\\
1 & x_{2} & x_{2}^{2} & \cdots & x_{2}^{M}\\
\vdots & \vdots & \vdots & \ddots & \vdots\\
1 & x_{N} & x_{N}^{2} & \cdots & x_{N}^{M}
\end{array}\right] = \left[\begin{array}{c}
\boldsymbol{\phi}^T(x_1)\\ \boldsymbol{\phi}^T(x_2) \\ \vdots \\ \boldsymbol{\phi}^T(x_N)\end{array}\right] \in \mathbb{R}^{N\times (M+1)}$$

where each row is a feature representation of a data point $x_i$.

Other **basis functions** include:

* Radial Basis functions (D = 1): $\phi_j(x) = \exp\left\{-\frac{(x-\mu_j)^2)}{2s^2}\right\}$ where $x \in R^1$

* Radial Basis function (D > 1): $\phi_j(\mathbf{x}) = \exp\left\{-\frac{1}{2}(x-\boldsymbol{\mu}_j)^T\Sigma_j^{-1}(x-\boldsymbol{\mu}_j)\right\}$ where $\mathbf{x} \in R^D$, $\boldsymbol{\mu}_j \in R^D$ and $\boldsymbol{\Sigma}_j \in R^{D\times D}$

* Fourier Basis functions

* Wavelets Basis Functions

<div class="alert alert-success">
    <b>Feature Space</b> 

The domain of $\boldsymbol{\phi}(\mathbf{x})$ defines the **feature space**:

\begin{align}
\boldsymbol{\phi}: \mathbb{R}^D & \rightarrow \mathbb{R}^{M+1} \\
\boldsymbol{\phi}(\mathbf{x}) & \rightarrow [1,\phi_1(\mathbf{x}), \phi_2(\mathbf{x}), ..., \phi_M(\mathbf{x})]
\end{align}
</div>

* When we use linear regression with respect to a set of (non-linear) basis functions, the regression model is linear in the *feature space* but non-linear in the input space.


<div class="alert alert-success">
    <b>Objective Function</b> 

We *fit* the polynomial regression model such that the *objective function* $E(\mathbf{w})$ is minimized:
$$\arg_{\mathbf{w}}\min E(\mathbf{w})$$
where $E(\mathbf{w}) = \frac{1}{2}\left\Vert \mathbf{\Phi}\mathbf{w} - \mathbf{t} \right\Vert^2_2$
</div>

<div><img src="figures/LeastSquares.png", width="300"><!div>

* This error function is minimizing the (Euclidean) *distance* of every point to the curve.

We **optimize** $E(\mathbf{w})$ by finding the *optimal* set of parameters $\mathbf{w}^*$ that minimize the error function. 

To do that, we **take the derivative of $E(\mathbf{w})$ with respect to the parameters $\mathbf{w}$**.

$$\frac{\partial E(\mathbf{w})}{\partial \mathbf{w}} = \left[ \frac{\partial E(\mathbf{w})}{\partial w_0},  \frac{\partial E(\mathbf{w})}{\partial w_1}, \ldots,  \frac{\partial E(\mathbf{w})}{\partial w_M} \right]^T$$

* If we rewrite the objective function as:
\begin{align}
E(\mathbf{w}) &= \frac{1}{2} \left( \mathbf{\Phi}\mathbf{w} - \mathbf{t}\right)^T\left( \mathbf{\Phi}\mathbf{w} - \mathbf{t}\right) \\
& = \frac{1}{2} \left( \mathbf{w}^T\mathbf{\Phi}^T - \mathbf{t}^T\right)\left( \mathbf{\Phi}\mathbf{w} - \mathbf{t}\right) \\
& = \frac{1}{2} \left(\mathbf{w}^T\mathbf{\Phi}^T\mathbf{\Phi}\mathbf{w} - \mathbf{w}^T\mathbf{\Phi}^T \mathbf{t} - \mathbf{t}^T\mathbf{\Phi}\mathbf{w} + \mathbf{t}^T\mathbf{t}\right)
\end{align}


* Solving for $\mathbf{w}$, we find:

\begin{align}
\frac{\partial E(\mathbf{w})}{\partial \mathbf{w}} &= 0 \\
\frac{\partial }{\partial \mathbf{w}} \left[\frac{1}{2} \left(\mathbf{w}^T\mathbf{\Phi}^T\mathbf{\Phi}\mathbf{w} - \mathbf{w}^T\mathbf{\Phi}^T \mathbf{t} - \mathbf{t}^T\mathbf{\Phi}\mathbf{w} + \mathbf{t}^T\mathbf{t}\right) \right] &= 0 \\
\frac{\partial }{\partial \mathbf{w}} \left[ \left(\mathbf{w}^T\mathbf{\Phi}^T\mathbf{\Phi}\mathbf{w} - \mathbf{w}^T\mathbf{\Phi}^T \mathbf{t} - \mathbf{t}^T\mathbf{\Phi}\mathbf{w} + \mathbf{t}^T\mathbf{t}\right) \right] &= 0 \\
(\mathbf{\Phi}^T\mathbf{\Phi}\mathbf{w})^T + \mathbf{w}^T\mathbf{\Phi}^T\mathbf{\Phi} - (\mathbf{\Phi}^T \mathbf{t})^T - \mathbf{t}^T\mathbf{\Phi} &=0 \\
\mathbf{w}^T\mathbf{\Phi}^T\mathbf{\Phi} + \mathbf{w}^T\mathbf{\Phi}^T\mathbf{\Phi} - \mathbf{t}^T\mathbf{\Phi} - \mathbf{t}^T\mathbf{\Phi} &= 0\\
2 \mathbf{w}^T\mathbf{\Phi}^T\mathbf{\Phi} &= 2 \mathbf{t}^T\mathbf{\Phi} \\
(\mathbf{w}^T\mathbf{\Phi}^T\mathbf{\Phi})^T &= (\mathbf{t}^T\mathbf{\Phi})^T\text{, apply transpose on both sides} \\
\mathbf{\Phi}^T\mathbf{\Phi}\mathbf{w} &= \mathbf{\Phi}^T\mathbf{t} \\
\mathbf{w} &= \left(\mathbf{\Phi}^T\mathbf{\Phi}\right)^{-1}\mathbf{\Phi}^T\mathbf{t}
\end{align}

## Suggested Additional Reading Materials

* From "Python Data Science Handbook" 2017 by Jake VanderPlas, read section "In Depth: Linear Regression", pages 390-405.
    
