# Introduction

Early morning and having ideas 🙄  This is based on my conversation with Elizabeth Newman and my recent discussion with Mia. 

Can one put convex glasses on a neural network? 😎

# Idea

Support I have a non-linear function $f(x)$, that appoximates some output $y$ given some input $x$.  Then the error function is given by

$$
E = \|f(X) - Y\|^2_F
$$

where $\|\cdot\|_F$ is the Frobenius norm.  We have also abused notation, but in a standard way, by letting $X$ be the matrix of inputs and $Y$ be the matrix of outputs.  Each row $x_i$ of $X$ and row $y_i$ of $Y$ is a training example and $f$ is applied row-wise.

The idea is to add a linear function $G$ that post-processes the output of $f$, such that the error function becomes

$$
E = \sum_{i=1}^k \|G \cdot f(x_i) - y_i\|^2_F. 
$$

where $k$ is the number of training examples.

Note, if $f$ is a *fixed* function then solving for $G$ is a convex problem.  I.e., for each training example $(x_i, y_i)$, we can replace $f(x_i)$ with $\hat{y}_i$
to get 

$$
E_G = \sum_{i=1}^k \|G \cdot \hat{y}_i - y_i\|^2_F. 
$$

In fact, $G$ can be solved for in closed form by using the normal equations!

Solving for $G$ is "free" (since it arises from a convex problem that can be solved in closed form) and we can use it to improve the performance of $f$, at least on the training data.  I.e., we know that

$$
E_G \leq E
$$

So, why don't we always do this?  I mean, given any method on any leaderboard that does not use this trick, we can always improve it by adding this trick.  

Well, one reason might be that solving for $G$ leads to overfitting.  But this seems a little surprising, since $G$ is a linear function.  So, it seems like it would be hard to overfit with a linear function, especially when the dimension of $G$ us the same as the *output* dimension of $f$.  In particular, it is **not a function of the number of unknowns** in $f$.

In fact, this might be why this is not commonly done. For example, when $y_i$ is a scalar, then $G$ is a scalar.  So, we are just adding a constant to the output of $f$.  This is not very interesting.  But, when $y_i$ is a vector, then $G$ is a matrix.  So, we are adding a linear transformation to the output of $f$.  This is more interesting, but there is a bigger idea here I think...🤔

# Bigger Idea

Suppose we have an $f$ which is of a special form. For example, suppose $f$ is a neural network.  Then, we can think of $f$ as a composition of functions

$$
f = f_n \circ f_{n-1} \circ \cdots \circ f_1
$$

where each $f_i$ is a layer of the neural network, or really any set of non-linear functions.

Rewriting the equation for $E_G$ in terms of the composition of functions, we get

$$
E_{G_n} = \sum_{i=1}^k \|G_n \cdot f(x_i) - y_i\|^2_F = \sum_{i=1}^k \|G_n \cdot f_n \circ f_{n-1} \circ \cdots \circ f_1(x_i) - y_i\|^2_F.
$$

Ok, now for a little bit of magic 🪄.  Just for the sake of arguement, consider the following

$$
E_{G_{n-1},G_{n}} = \sum_{i=1}^k \|G_{n-1} \cdot f_{n-1} \circ \cdots \circ f_1(x_i) + G_n \cdot f_n \circ f_{n-1} \circ \cdots \circ f_1(x_i) - y_i\|^2_F.
$$

In the neural network parlance this is like adding a skip connection between the output of $f_{n-1}$ and the output of $f_n$.  But, we are not adding a standard skip connection, we are instead adding a linear transformation to the output of $f_{n-1}$.  So, it is still a convex problem that can be solved in closed form!  I.e.,
it can be rewritten as

$$
E_{G_{n-1},G_{n}} = \sum_{i=1}^k \left| \begin{bmatrix} G_{n-1} & G_n \\ \end{bmatrix} \cdot \begin{bmatrix} f_{n-1} \circ \cdots \circ f_1(x_i) \\ f_n \circ f_{n-1} \circ \cdots \circ f_1(x_i) \\ \end{bmatrix} - y_i \right|^2_F.
$$


# Optimal skip connections!