# Estimation by score matching

<div class="definition">
    
**Definition (Score matching objective)** Given a data distribution $p_d(x)$ and an approximating distribution $p(x; \theta)$ with parameters $\theta$, we define the score matching objective as
    
$$\begin{align}
J(\theta) = \frac{1}{2} \int p_d(x) || \psi(x; \theta) - \psi_d(x)||^2 dx,
\end{align}$$
    
where $\psi(x; \theta) = \nabla p(x;, \theta)$ and $\psi_d(x) = \nabla p_d(x)$ and the derivatives are with respect to $x$.
    
</div>
<br>

<div class="theorem">
    
**Theorem (Score matching $\iff$ maximum likelihood)** Suppose that the probability density function of $x$ satisfies $p_d(x) = p(x; \theta)$ for some $\theta^*$ and also that if $\theta^* \neq \theta$ then $p(x; \theta) \neq p_d(x)$. Suppose also that $p(x; \theta) > 0$. Then
    
$$\begin{align}
J(\theta) = 0 \iff \theta = \theta^*.
\end{align}$$
    
</div>
<br>

<details class="proof">
<summary>Proof: Score matching \(\iff\) maximum likelihood</summary>
    
    
**Is implied by:** We can see that $\theta = \theta^* \implies J(\theta) = 0$ by substituting $p_d(x) = p(x; \theta^*)$ into $J(\theta^*)$ 
    
$$\begin{align}
J(\theta^*) &= \frac{1}{2} \int p_d(x) || \psi(x; \theta^*) - \psi_d(x)||^2 dx \\
            &= \frac{1}{2} \int p(x; \theta^*) || \psi(x; \theta^*) - \psi(x; \theta^*)||^2 dx \\
            &= 0.
\end{align}$$
    
**Implies:** Going the other direction, we can show that $J(\theta) = 0 \implies \theta = \theta^*$ by considering
    
$$\begin{align}
J(\theta) &= \frac{1}{2} \int p(x; \theta^*) || \psi(x; \theta) - \psi(x; \theta^*)||^2 dx = 0.
\end{align}$$
    
Since $p(x; \theta^*) > 0$, the above can hold only if $\psi(x; \theta) = \psi(x; \theta^*)$ for every $x$. This means that
    
$$\begin{align}
\psi(x; \theta) = \psi(x; \theta^*) + \text{const.} \implies p(x; \theta) \propto p(x; \theta^*),
\end{align}$$
    
and since $p(x; \theta^*)$ is a normalised probability distribution, we arrive at $p(x; \theta) = p(x; \theta^*)$. Now since the $p(x; \theta^*)$ is unique for the particular $\theta^*$, we have that $\theta = \theta^*$.

</details>
<br>

<div class="theorem">
    
**Theorem (Equivalent form of $J$)** Given a score function $\psi(x; \theta)$ which is differentiable w.r.t. $x$ and satisfies some weak regularity conditions. Then the score-matching function $J$ can be writtten as
    
$$\begin{align}
J(\theta) = \int p_d(x) \sum^N_{i = 1}\left[ \partial_i \psi_i(x; \theta) + \frac{1}{2} \psi_{d, i}(x)^2 \right] dx + \text{const.},
\end{align}$$
    
where the $i$-subscript denotes the $i^{th}$ entry of vector being indexed and $\partial_i$ denotes the partial derivative with respect to $x_i$. The constant term is independent of $\theta$.
    
</div>
<br>

<details class="proof">
<summary>Proof: Equivalent form of \(J\)</summary>
    
Writing out $J$

$$\begin{align}
J(\theta) = \frac{1}{2} \int p_d(x) \left[ \psi(x; \theta)^\top\psi(x; \theta) - 2\psi(x; \theta)^\top\psi_d(x) + \psi_d(x)^\top\psi_d(x)\right] dx,
\end{align}$$
    
we see that the last term in the brackets evaluates to a constant that is independent of $\theta$. Using the fact that
    
$$\begin{align}
p_d(x) \psi_d(x) = p_d(x) \nabla \log p_d(x) = \nabla p_d(x),
\end{align}$$
    
and applying integration by parts, we obtain

$$\begin{align}
\int p_d(x) \psi(x; \theta)^\top\psi_d(x) dx &= \int \psi(x; \theta)^\top \nabla p_d(x) dx \\
&= \big[p_d(x) \psi(x; \theta) \big]_{-\infty}^{\infty} - \int p_d(x) \partial_i \psi_{d, i}(x; \theta) dx.
\end{align}$$
    
Substituting this into the expression for $J$ we arrive at
    
$$\begin{align}
J(\theta) = \int p_d(x) \sum^N_{i = 1}\left[ \partial_i \psi_i(x; \theta) + \frac{1}{2} \psi_{d, i}(x)^2 \right] dx + \text{const.},
\end{align}$$
    
</details>
<br>

In [2]:
import numpy as np
import matplotlib.pyplot as plt

from IPython.display import HTML, set_matplotlib_formats
set_matplotlib_formats('pdf', 'svg')
css_style = open('../../../_static/custom_style.css', 'r').read()
HTML(f'<style>{css_style}</style>')