# Derivation: Support Vector Regression (SVR)
In the lecture notes, we discussed Support Vector Machines (SVM) for classification tasks. Now, let's explore how to adapt the SVM framework for regression tasks, leading to Support Vector Regression (SVR). 

Suppose we have a dataset $\mathcal{D} = \{(\hat{\mathbf{x}}_{i}, y_{i}) \mid i = 1,2,\dots,n\}$, where $\hat{\mathbf{x}}_i \in \mathbb{R}^p$ is an _augmented_ feature vector (with a trailing `1` for the bias) and $y_i \in \mathbb{R}$ is a scalar target.

> __What is the goal of SVR?__
>
> The goal of SVR is to estimate a regression function $f(\hat{\mathbf{x}})=\left<\hat{\mathbf{x}},\theta\right>$ that is _as flat as possible_ while keeping prediction errors within an $\varepsilon$-insensitive tube.

Instead of penalizing every error, SVR only penalizes deviations larger than $\varepsilon$. This gives a robust regression model that ignores small errors and focuses on the largest deviations.

> __Soft Margin SVR Problem__
>
> The $\varepsilon$-insensitive soft margin problem is given by:
> $$
\begin{align*}
    \min_{\theta, \xi, \xi^{*}}\quad & \frac{1}{2}\lVert{\theta}\rVert_{2}^{2} + C\sum_{i=1}^{n}(\xi_{i} + \xi_{i}^{*})\\
    \text{subject to}\quad & y_{i} - \left<\hat{\mathbf{x}}_{i},\theta\right> \leq \varepsilon + \xi_{i}\quad\forall i\\
    & \left<\hat{\mathbf{x}}_{i},\theta\right> - y_{i} \leq \varepsilon + \xi_{i}^{*}\quad\forall i\\
    & \xi_{i} \geq 0,\; \xi_{i}^{*} \geq 0\quad\forall i
\end{align*}
> $$
> where $\xi_i$ and $\xi_i^{*}$ are slack variables that measure violations above and below the $\varepsilon$-tube, and $C>0$ controls the trade-off between flatness and error tolerance.

__Values of $\varepsilon$ and C__: Larger $\varepsilon$ widens the tube (fewer support vectors, more bias), while smaller $\varepsilon$ tightens the fit (more support vectors, less bias). The parameter $C$ controls how expensive tube violations are, just like in soft-margin classification.

__Do we solve this?__ Yes, but we often use an equivalent unconstrained form with the $\varepsilon$-insensitive loss:
$$
\begin{equation*}
    \min_{\theta}\left[\frac{1}{2}\lVert{\theta}\rVert_{2}^{2} + C\sum_{i=1}^{n}\max\{0, |y_i - \left<\hat{\mathbf{x}}_{i},\theta\right>| - \varepsilon\}\right]
\end{equation*}
$$
which is the direct regression analog of the hinge-loss formulation used for classification.

___

## Kernelized SVR
If the regression function is nonlinear, we use the kernel trick as before. Let $\phi(\hat{\mathbf{x}})$ denote the feature map associated with a kernel $K(\hat{\mathbf{x}}_{i},\hat{\mathbf{x}}_{j}) = \left<\phi(\hat{\mathbf{x}}_{i}),\phi(\hat{\mathbf{x}}_{j})\right>$.

> __Kernelized SVR (dual)__
>
> The dual for the augmented-input formulation is:
> $$
\begin{align*}
    \max_{\alpha, \alpha^{*}}\quad & -\frac{1}{2}\sum_{i=1}^{n}\sum_{j=1}^{n}(\alpha_{i}-\alpha_{i}^{*})(\alpha_{j}-\alpha_{j}^{*})K(\hat{\mathbf{x}}_{i},\hat{\mathbf{x}}_{j})\\
    & \quad - \varepsilon\sum_{i=1}^{n}(\alpha_{i}+\alpha_{i}^{*}) + \sum_{i=1}^{n}y_{i}(\alpha_{i}-\alpha_{i}^{*})\\
    \text{subject to}\quad & 0 \leq \alpha_{i} \leq C,\; 0 \leq \alpha_{i}^{*} \leq C\quad\forall i
\end{align*}
> $$
> where $\alpha_i$ and $\alpha_i^{*}$ are Lagrange multipliers (one pair per training example).

__Note.__ Because we use augmented inputs, the bias term is regularized along with the weights, so there is no separate equality constraint. If $b$ is kept explicit, add the constraint $\sum_{i=1}^{n}(\alpha_i-\alpha_i^{*})=0$ and include a separate bias term in the prediction.

__Decision function.__ After solving the dual, the regressor is:
$$
\begin{equation*}
    f(\hat{\mathbf{x}}) = \sum_{i=1}^{n}(\alpha_{i}-\alpha_{i}^{*})K(\hat{\mathbf{x}}_{i},\hat{\mathbf{x}})
\end{equation*}
$$
Only points with $\alpha_{i} > 0$ or $\alpha_{i}^{*} > 0$ appear in the sum; these are the _support vectors_. Points strictly inside the $\varepsilon$-tube have $\alpha_i=\alpha_i^{*}=0$.

> __KKT reminder.__ Complementary slackness gives $\alpha_i\left[y_i - f(\hat{\mathbf{x}}_i) - \varepsilon - \xi_i\right]=0$ and $\alpha_i^{*}\left[f(\hat{\mathbf{x}}_i) - y_i - \varepsilon - \xi_i^{*}\right]=0$, so only points on or outside the tube can be support vectors.

__Do we solve this?__ Yes! The dual is a quadratic program that depends only on the kernel matrix $K(\hat{\mathbf{x}}_{i},\hat{\mathbf{x}}_{j})$.
___