### Many Regression Algorithms, One Unified Model - A Review
1. Function approximators are often used to capture learned paths, for example, in DMPs
2. Despite their many flavours (LWR, GPR, GMR, etc), they form a special case of the unified model
3. This papers contribution is:
<br> (1) A wide variety of regression algorithms fall into two main classes: a mixture of linear models or a weighted sum of basis functions
<br> (2) The second class is a special case of the former

### Least squares regression
1. $a^* = arg min_a (y-Xa)^T (y-Xa)$. The solution is $a^* = (X^T X)^{-1} X^T y$
2. $a^* = arg min_a (\frac{\lambda}{2} ||a||^2 + \frac{1}{2} ||y-X^T a||^2)$, known as Thikonov regularization or Ridge regression. The solution is $a^* = (\lambda I + X^T X)^{-1} X^T y$. The $L_1$ norm can be applied to the regularization term.
3. These are batch learning methods. There are also incremental least squares methods, for example, recursive least squares
4. Due ot the inversion of the matrix, the complexity is $O(n^3)$. Although the Sherman-Morrison formula can be used to reduce the inversion complexity to $O(n^2)$, the method is sensitive to rounding errors.

### Model parameters vs Meta parameters
1. Algorithms are designed to determine the optimal values of the parameters of the model, given an optimization criterion. Meta-parameters are algorithmic parameters that the *user has to provide* as an input to the algorithm. Take for example, $f = a^T x + b$, the model parameters are $a$ and $b$, they are **all** the paramters required to make a prediction for a novel output.
2. For least squares, there are no meta-parameters; for Thikonov regularization, the user has to tune the parameter $\lambda$. The parameter $\lambda$ is thus a meta-parameter.
3. Another way of looking at these parameters is that model parameters depend only on the training data. With regularized LLS, the resulting parameter $a$ depends on both the training data and the meta-parameter $\lambda$

### Nonlinear regression using linear methods
1. In general there are two ways to approach nonlinear regression: (1) Algorihtms that perform multiple weighted LLS regressions, using different input-dependent weighting functions. The resulting model is a mixture of linear models. Examples include LWR, GMR, LWPR. (2) Algorithms that project the input space into a feature space using a set of non-linear basis functions, and performing one LLS regression in this projected feature space. Examples include RBFNs and KRR

### Model Type (1)  Mixture of linear models
1. The underlying model is a mixture of linear models, where the *sub-models* are linear and where the *weights* are determined by the normalized weighting functions
2. Algorithm can be defined as $f(x) = \sum^E \phi(x,\theta_e).(a_e^T x + b_e)$

#### LWR
1. Locally weighted regression (LWR) uses the cost function $S(a) = \sum^N w_n (y_n - a^T x_n)^2 = (y-Xa)^T W (y-Xa)$ and the solution to the problem is $a^* = (X^T W X)^{-1} X^T W y$ where $W$ is a diagonal matrix
2. The weights for each sample are typically defined as a function of the input space thorugh a function $\phi$ parameterized with $\theta$, i.e. $w_n = \phi(x_n, \theta)$ where $\theta$ is a fixed parameter
3. A commonly used weighting function is the multivariate Gaussian:
<br> $\phi(x_n, \theta) = g(x_n, c, \Sigma)$ with $\theta = (c, \Sigma)$
<br> $g(x,c,\Sigma) = exp(-\frac{1}{2}(x-c)^T \Sigma^{-1} (x-c))$
4. LWR is an extension of the weighted linear least squares, in which $E$ independent weighted regressions are performed on the same data (in the design matrix $X$), but with $E$ independent weight matrices $W_e$:
<br> $a_e = (X^T W_e X)^{-1} X^T W_e y, \forall e = 1...E$
5. The resulting model is $f(x) = \sum^E \phi(x,\theta_e).(a_e^T x + b_e)$ where the basis functions are often selected as normalized gaussian weighting function $\phi(x,\theta_e) = \frac{g(x,c_e,\Sigma_e)}{\Sigma^E g(x,c_{e'},\Sigma_{e'})}$ with $\theta_e = (c_e, \Sigma_e)$

#### GMR
1. Gaussian mixture regression assumes that the data in the joint input $\times$ output ($x-y$) space can be represented by a set of gaussians, which is known as a gaussian mixture model (GMM)
2. A notable feature of GMR is that the training phase consists of unsupervised learning, performed by fitting a GMM to the data with the Expectation-maximization (EM) algorithm. Usually k-means clustering is used to provide a first initialization of the centers.
3. Because EM is an unsupervised learning algorithm, there  is no distinction between an input $x_n$ and a target $y_n$. They are concatenated into one vector $z_n = [x_n^T y_n]^T$ The GMM represents a model of the density of the vectors $z_n$ as a weighted sum of $E$ gaussian functions: $p(z_n) = \sum^E \pi_e N(z_n; \mu_e, \Sigma_e)$ where $\sum \pi_e = 1$

### Model Type (2) Basis function network
1. Weighted mixture of basis functions, i.e. $f(x) = \sum^E \phi(x,\theta_e).w_e$
2. Least squares for basis function networks results in weights of the form $w^* = (Z^T Z)^{-1} Z^T y$ or $w^* = (\lambda I + Z^T Z)^{-1} Z^T y$. $Z$ can take three forms:
<br> Design matrix
<br> Feature matrix
<br> Gram matrix

#### Design matrix
$Z = X = \begin{bmatrix} x_{1,1} & ... & x_{1,D} \\ ... & ... & ... \\ x_{N,1} & ... & x_{N,D} \end{bmatrix} \in \mathbb{R}^{N \times D}$ (smallest)

#### Feature matrix
$Z = \Phi(X) = \begin{bmatrix} \phi_{x_1,\theta_1} & ... & \phi_{x_1,\theta_E} \\ ... & ... & ... \\ \phi_{x_N,\theta_1} & ... & \phi_{x_N,\theta_E} \end{bmatrix} \in \mathbb{R}^{N \times E}$ (medium)

#### Gram matrix (kernel functions)
$Z = K(X,X) = \begin{bmatrix} k_{x_1,x_1} & ... & k_{x_1,x_N} \\ ... & ... & ... \\ k_{x_N,x_1} & ... & k_{x_N,x_N} \end{bmatrix} \in \mathbb{R}^{N \times N}$ (largest)

$w^* = K^{-1} y$ and $w^* = (\lambda I + K)^{-1} y$ (kernel trick) since the gram matrix is symmetrical (and square)

#### RBFN
1. RBFN is a specialization of the function used by LWR with $w_e = b_e$ and $a_e = \bf{0}$

### Examples using scikitlearn
https://jakevdp.github.io/PythonDataScienceHandbook/05.12-gaussian-mixtures.html

### Metaparameters
1. Often, it is easier to specify the configuration of basis functions using meta-parameters. For example, centers of gaussian functions can be selected using the intersection height with nearby gaussians. If the centers are spaced equally apart, this height determines the width of the gaussian function.
2. Metaparameters will need to be combined with the training data to form model parameters for the unified model.