In [1]:
run load_libs.py

<IPython.core.display.Javascript object>

Minkowski loss has the form

\begin{align}
\mathbb{E}[L] = \iint \lvert y(\mathbf{x}) - t \rvert^q p(\mathbf{x}, t) d\mathbf{x} dt
\end{align}

where $q > 0$. The commonly used mean squared error and mean absolute error are two specific cases of Minkowski loss with $q=2$ and $q=1$, respectively.

## Mean squared error (MSE, Minkowski loss with $q=2$)

Definition:

\begin{align}
\mathbb{E}[L] = \iint \left[ y(\mathbf{x}) - t \right]^2 p(\mathbf{x}, t) d\mathbf{x} dt
\end{align}

To minimize it, set the derviative wrt. $y(\mathbf{x})$ to zero,

\begin{align}
\frac{\partial \mathbb{E}[L]}{\partial y(\mathbf{x})}
= \int 2 (y(\mathbf{x}) - t) p(\mathbf{x}, t) dt
&= 0 \\
\int y(\mathbf{x}) p(\mathbf{x}, t) dt
&= \int t p(\mathbf{x}, t) dt \\
y(\mathbf{x})
&= \frac{\int{ t p(\mathbf{x}, t) dt }}{p(\mathbf{x})} \\
&= \int t p(t | \mathbf{x}) dt \\
&= \mathbb{E}\left[t|\mathbf{x} \right]
\end{align}

So the conditional mean minimizes the mean squared error. In other words, when
minimizing the mean squared error, the model is trying to predict the
conditional mean of $t$ given $\mathbf{x}$, i.e. $\mathbb{E}\left[t \vert
\mathbf{x} \right]$.

## Mean absolute error (MAE, Minkowski loss with $q=1$)

Definition:

\begin{align}
\mathbb{E}[L] = \iint \left| y(\mathbf{x}) - t \right| p(\mathbf{x}, t) d\mathbf{x} dt
\end{align}

To minimize it, set the derviative wrt. $y(\mathbf{x})$ to zero,

\begin{align}
\frac{\partial \mathbb{E}[L]}{\partial y(\mathbf{x})}
= \int_{-\infty}^{y(\mathbf{x})} p(\mathbf{x}, t) dt + \int_{y(\mathbf{x})}^{\infty} - p(\mathbf{x}, t) dt
&= 0 \\
F_{t|\mathbf{x}}(y(\mathbf{x})) - (1 - F_{t|\mathbf{x}}(y(\mathbf{x})))
&= 0 \\
2F_{t|\mathbf{x}}(y(\mathbf{x})) - 1
&= 0 \\
F_{t|\mathbf{x}}(y(\mathbf{x}))
&= 0.5 \\
y(\mathbf{x})
&= F_{t|\mathbf{x}}^{-1}(0.5)
\end{align}

So when minimizing the mean absolute error, the model is trying to predict the
conditional median of $t$ given $\mathbf{x}$.

## Minkowski loss with $q \rightarrow 0$

Definition:

\begin{align}
\mathbb{E}[L] = \lim_{q \rightarrow 0} \iint \lvert y(\mathbf{x}) - t \rvert^q p(\mathbf{x}, t) d\mathbf{x} dt
\end{align}

In order to work out what is to be predicted when minimizing the loss, let's
plot how $\lvert y(\mathbf{x}) - t \rvert^q$ changes as a function of
$y(\mathbf{x}) - t$ for different $q$s.

In [4]:
deltas = np.concatenate(
    [
        # sample more points around 0
        np.arange(-1, 1.1, 0.01),
        np.arange(-2, 2.1, 0.1),
    ]
)

dfs = []
for q in [0.001, 0.01, 0.05, 0.1, 0.3, 1, 2, 10, 20, 100]:
    _df = pd.DataFrame(
        {
            "delta": deltas,
            "loss": np.abs(deltas) ** q,
        }
    ).assign(q=q)
    dfs.append(_df)
    
df_plot = pd.concat(dfs)

In [5]:
alt.Chart(df_plot, height=120, width=150).mark_line(clip=True).encode(
    x=alt.X("delta:Q", title='Δ = y(x) - t'),
    y=alt.Y(
        "loss:Q",
        scale=alt.Scale(domain=(0, 2)),
    ),
).facet(facet="q:Q", columns=5)

As seen, as $q \rightarrow 0$ (top left panels), $|y(\mathbf{x}) - t|^q$ becomes 1 everywhere
but when $\Delta$ is close to 0, where the loss is below 1. Therefore, to
minimize the loss, we want to maximize the $p(\mathbf{x}, t)$ when
$y(\mathbf{x}) = t$ for every $\mathbf{x}$, which is equivalent to taking
$y(\mathbf{x})$ as the mode of $p(t|\mathbf{x})$, i.e. conditional mode.

Also, it's interesting to note that when $q$ becomes large (lower right panels),
$|y(\mathbf{x}) - t|^q$ becomes 0 almost everywhere within (-1, 1), but then
shoot up dramatically when $\Delta$ becomes smaller than -1 or larger than 1.