In [5]:
run load_libs.py

<IPython.core.display.Javascript object>

Minkowski loss has the form

\begin{align}
\mathbb{E}[L] = \iint \lvert y(\mathbf{x}) - t \rvert^q p(\mathbf{x}, t) d\mathbf{x} dt
\end{align}

where $q > 0$. The commonly used mean squared error and mean absolute error are two specific cases of Minkowski loss with $q=2$ and $q=1$, respectively.

## Mean squared error (MSE, Minkowski loss with $q=2$)

Definition:

\begin{align}
\mathbb{E}[L] = \iint \left[ y(\mathbf{x}) - t \right]^2 p(\mathbf{x}, t) d\mathbf{x} dt
\end{align}

To minimize it, set the derviative wrt. $y(\mathbf{x})$ to zero,

\begin{align}
\frac{\partial \mathbb{E}[L]}{\partial y(\mathbf{x})}
= \int 2 (y(\mathbf{x}) - t) p(\mathbf{x}, t) dt
&= 0 \\
\int y(\mathbf{x}) p(\mathbf{x}, t) dt
&= \int t p(\mathbf{x}, t) dt \\
y(\mathbf{x})
&= \frac{\int{ t p(\mathbf{x}, t) dt }}{p(\mathbf{x})} \\
&= \int t p(t | \mathbf{x}) dt \\
&= \mathbb{E}\left[t|\mathbf{x} \right]
\end{align}

So the conditional mean minimizes the mean squared error. In other words, when
minimizing the mean squared error, the model is trying to predict the
conditional mean of $t$ given $\mathbf{x}$, i.e. $\mathbb{E}\left[t \vert
\mathbf{x} \right]$.

## Mean absolute error (MAE, Minkowski loss with $q=1$)

Definition:

\begin{align}
\mathbb{E}[L] = \iint \left| y(\mathbf{x}) - t \right| p(\mathbf{x}, t) d\mathbf{x} dt
\end{align}

To minimize it, set the derviative wrt. $y(\mathbf{x})$ to zero,

\begin{align}
\frac{\partial \mathbb{E}[L]}{\partial y(\mathbf{x})}
= \int_{-\infty}^{y(\mathbf{x})} p(\mathbf{x}, t) dt + \int_{y(\mathbf{x})}^{\infty} - p(\mathbf{x}, t) dt
&= 0 \\
F_{t|\mathbf{x}}(y(\mathbf{x})) - (1 - F_{t|\mathbf{x}}(y(\mathbf{x})))
&= 0 \\
2F_{t|\mathbf{x}}(y(\mathbf{x})) - 1
&= 0 \\
F_{t|\mathbf{x}}(y(\mathbf{x}))
&= 0.5 \\
y(\mathbf{x})
&= F_{t|\mathbf{x}}^{-1}(0.5)
\end{align}

So when minimizing the mean absolute error, the model is trying to predict the
conditional median of $t$ given $\mathbf{x}$.

## Minkowski loss with $q \rightarrow 0$

Definition:

\begin{align}
\mathbb{E}[L] = \lim_{q \rightarrow 0} \iint \lvert y(\mathbf{x}) - t \rvert^q p(\mathbf{x}, t) d\mathbf{x} dt
\end{align}

In order to work out what is to be predicted when minimizing the loss, let's
plot how $\lvert y(\mathbf{x}) - t \rvert^q$ changes as a function of
$y(\mathbf{x}) - t$ for different $q$s.

In [13]:
deltas = np.concatenate(
    [
        # sample more points around 0
        np.arange(-1, 1.1, 0.01),
        np.arange(-2, 2.1, 0.1),
    ]
)

dfs = []
for q in [0.001, 0.01, 0.05, 0.1, 0.3, 1, 2, 10, 20, 100]:
    _df = pd.DataFrame(
        {
            "delta": deltas,
            "loss": np.abs(deltas) ** q,
        }
    ).assign(q=q)
    dfs.append(_df)
    
df_plot = pd.concat(dfs)

Visualize all $q$ values in one panel

In [17]:
alt.Chart(df_plot).mark_line(clip=True).encode(
    x=alt.X("delta:Q", title="Δ = y(x) - t"),
    y=alt.Y(
        "loss:Q",
        scale=alt.Scale(domain=(0, 2)),
    ),
    color=alt.Color("q:Q", scale=alt.Scale(type="log")),
)

Visualize all $q$ values in separate panels

In [18]:
alt.Chart(df_plot, height=120, width=150).mark_line(clip=True).encode(
    x=alt.X("delta:Q", title='Δ = y(x) - t'),
    y=alt.Y(
        "loss:Q",
        scale=alt.Scale(domain=(0, 2)),
    ),
).facet(facet="q:Q", columns=5)

As seen, as $q \rightarrow 0$ (top left panels), $|y(\mathbf{x}) - t|^q$ becomes 1 everywhere
but when $\Delta$ is close to 0, where the loss is below 1. Therefore, to
minimize the loss, we want to maximize the $p(\mathbf{x}, t)$ when
$y(\mathbf{x}) = t$ for every $\mathbf{x}$, which is equivalent to taking
$y(\mathbf{x})$ as the mode of $p(t|\mathbf{x})$, i.e. conditional mode.

Also, it's interesting to note that when $q$ becomes large (lower right panels),
$|y(\mathbf{x}) - t|^q$ becomes 0 almost everywhere within (-1, 1), but then
shoot up dramatically when $\Delta$ becomes smaller than -1 or larger than 1.

# Maximum likelihood interpretation

For a finte set of training data, the Minkowski loss can be written as

\begin{align*}
L = \frac{1}{N} \sum_{n=1}^N \lvert y(\mathbf{x}_n) - t_n \rvert^q 
\end{align*}

where $N$ is the total number of examples, and $n$ is the index.

If we assume the likelihood of an example to be of the form

\begin{align}
p(t|\mathbf{x}; \boldsymbol{\theta}) &= \frac{1}{Z} \exp \left \{ - \frac{\lvert y(\mathbf{x}) - t \rvert^q}{b} \right \}
\end{align}

where

* $\boldsymbol{\theta}$ is the model parameters.
* $Z$ is the normalization factor, a constant.

then the likelihood of the data is

\begin{align*}
\mathcal{L}
&= \prod_{n=1}^N p(t_n|\mathbf{x}_n; \boldsymbol{\theta}) \\
&= \prod_{n=1}^N \frac{1}{Z} \exp \left \{ - \frac{\lvert y(\mathbf{x}_n) - t_n \rvert^q}{b} \right \}
\end{align*}

Take log-likelihood,

\begin{align*}
\ell 
&= \sum_{n=1}^N \log p(t_n|\mathbf{x}; \boldsymbol{\theta}) \\
&= \sum_{n=1}^N \left[ - \log Z - \frac{\lvert y(\mathbf{x}_n) - t_n \rvert^q}{b} \right ] \\
&= - N \log Z - \frac{N}{b} \sum_{n=1}^N \lvert y(\mathbf{x}_n) - t_n \rvert^q
\end{align*}

Ignoring the constants,

\begin{align*}
\ell'
&= - \sum_{n=1}^N \lvert y(\mathbf{x}_n) - t_n \rvert^q
\end{align*}

which recovers the loss function $L$ scaled by $-N$.

So minimizing the Minkowski loss function is equivalent to maximizing the likelihood assuming the likelihood to be of a particular form.

Below we visualize the likelihood for different $q$ values, we ignored all the constants for simplificity just to gain a idea of the shape of the distribution, so the probability density is **NOT normalized**.

In [19]:
def likelihood(delta, q):
    return np.exp(- np.abs(delta) ** q)

In [41]:
deltas = np.concatenate(
    [
        # sample more points around 0
        np.arange(-1.1, 1.1, 0.01),
        np.arange(-4, 4.1, 0.1),
    ]
)

dfs = []
for q in [0.001, 0.01, 0.05, 0.1, 0.3, 1, 2, 10, 20, 100]:
    _df = pd.DataFrame(
        {
            "delta": deltas,
            "likelihood": likelihood(deltas, q),
        }
    ).assign(q=q)
    dfs.append(_df)
    
df_plot = pd.concat(dfs)

Visualize all $q$ values in one panel

In [42]:
alt.Chart(df_plot).mark_line(clip=True).encode(
    x=alt.X("delta:Q", title="Δ = y(x) - t"),
    y=alt.Y(
        "likelihood:Q",
    ),
    color=alt.Color("q:Q", scale=alt.Scale(type="log")),
)

Visualize all $q$ values in separate panels

In [44]:
alt.Chart(df_plot, height=120, width=150).mark_line(clip=True).encode(
    x=alt.X("delta:Q", title='Δ = y(x) - t'),
    y=alt.Y(
        "likelihood:Q",
    ),
).facet(facet="q:Q", columns=5)

Note,

* When $q = 1$, the corresponding distribution is named [Laplace distribution](https://en.wikipedia.org/wiki/Laplace_distribution).
* When $q = 2$, the corresponding distribution is named [Gaussian distribution](https://en.wikipedia.org/wiki/Normal_distribution)
* When $q \rightarrow 0$, the decay of density as Δ goes big gets very slow.
* When $q \rightarrow \infty$, it becomes a uniform distribution between [-1, 1], when $|\Delta| < 1$.

#### Compared Laplace distribution and Gaussian distribution at the same variance.

Laplace:

\begin{align*}
p(t|\mathbf{x}; \boldsymbol{\theta})
&= \frac{1}{2 b} \exp \left \{- \frac{|t - y(\mathbf{x})|)}{b} \right \} \\
\text{mean}
&= y \\
\text{variance}
&= 2 b^2 \\
\end{align*}


Gaussian:

\begin{align*}
p(t|\mathbf{x}; \boldsymbol{\theta})
&= \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left \{- \frac{|t - y(\mathbf{x})|^2}{2 \sigma^2}) \right \} \\
\text{mean}
&= y \\
\text{variance}
&= \sigma^2 \\
\end{align*}

In [48]:
def laplace_pdf(x, μ, b):
    return 1 / (2 * b) * np.exp(- np.abs(x - μ) / b)

In [49]:
def gaussian_pdf(x, μ, σ):
    return 1 / np.sqrt(2 * np.pi * σ ** 2) * np.exp(-((x - μ) ** 2) / σ ** 2)

In [50]:
ts = np.arange(-10, 10, 0.04)
mu = 0

dfs = []
for variance in [1, 3, 5]:
    dfs.append(
        pd.DataFrame({
            't': ts,
            'mu': 0,
            'variance': variance,
            'type': 'Laplace',
            'density': laplace_pdf(ts, μ=0, b=np.sqrt(variance / 2))
        })
    )
    
    dfs.append(
        pd.DataFrame({
            't': ts,
            'mu': 0,
            'variance': variance,
            'type': 'Gaussian',
            'density': gaussian_pdf(ts, μ=0, σ=np.sqrt(variance))
        })
    )
    
ndf = pd.concat(dfs).reset_index(drop=True)

In [53]:
alt.Chart(ndf).mark_line().encode(
    x=alt.X("t", title="t - y"),
    y="density",
    color="type:N",
).properties(width=250, height=200).facet(facet="variance")

#### Laplace distributions of different $b$

In [57]:
xs = np.arange(-10, 10, 0.04)
μ = 0

dfs = []
for b in [1, 3, 5, 10]:
    variance = 2 * b ** 2
    ys = laplace_pdf(xs, μ, b)
    dfs.append(
        pd.DataFrame(
            {
                "xs": xs,
                "ys": ys,
            }
        ).assign(b=b, variance=variance)
    )

ndf = pd.concat(dfs).reset_index(drop=True)

alt.Chart(ndf).mark_line().encode(
    x=alt.X("xs", title='x'),
    y=alt.Y("ys", title='Density'),
    color="variance",
)