# Complete derivation of belief update rules for chains



## I. Formal definitions

### Evidence

<img src = '../illustrations/belief-propagation-in-chain-evidence-i.png' width=100%>



* **Direct evidence** $\varepsilon_V$ for a node $V$ is an observation $V=v_*$ that determines the **local likelihood** $\lambda_V^*(v)=[v=v_*]$.
* **Indirect evidence** $\varepsilon_V$ for a node $V$ is a partial observation that determines the **local likelihood** $\lambda_V^*(v)=\Pr[\varepsilon_V|V=v]$.


**Example:** 
* For example, an indirect observation $v\in\{0, 1\}$ leads to 
\begin{align*}
\lambda_v^*=
\begin{cases}
1, &\text{if } v=0,\\
1, &\text{if } v=1,\\
0, &\text{otherwise}.
\end{cases}
\end{align*}

 ### Evidence partitioning
 
 * **Evidence** $\mathsf{evidence}$ is the summary evidence of all nodes in the chain. 
 * **Upstream evidence** $\color{red}{\mathsf{evidence}^+}(V)$ is the evidence of all nodes preceeding $V$ together with the evidence for $V$.
 * **Downstream evidence** $\color{blue}{\mathsf{evidence}^-}(V)$ is the evidence of all nodes succeeding $V$ together with the evidence for $V$.


 <img src = '../illustrations/belief-propagation-in-chain-evidence-ii.png' width=100%>

In this figure, upstream and downstream evidence for node $D$ are the following:

\begin{align*}
\color{red}{\mathsf{evidence}^+}(D)  &= \{\varepsilon_A, \varepsilon_C, \varepsilon_D\} \\
\color{blue}{\mathsf{evidence}^-}(D) &= \{\varepsilon_D, \varepsilon_G\}\enspace.
\end{align*}

### Probabilities associated with nodes

* For a node $V$ the **prior** $\pi_V(v)=\Pr[V=v|\color{red}{\mathsf{evidence}^+}(V)]$.
* For a node $V$ the **likelihood** $\lambda_V(v)=\Pr[\color{blue}{\mathsf{evidence}^-}(V)|V=v]$.
* For a node $V$ the **marginal posterior probability** $p_V(v)=\Pr[V=v|\color{red}{\mathsf{evidence}^+}(V), \color{blue}{\mathsf{evidence}^-}(V)]$.

## II. Derivation of iterative update rules

### Marginal posterior probabilities


<img src = '../illustrations/belief-propagation-in-chain-marginal-posterior-i.png' width=100%>

<div class="alert alert-info">
    
### Quick prelude: Bayes formula with additional knowledge 



Let $a$ and $b$ be the events we want to exchange in the conditional probality and let $c$ be the additional knowledge that remains always in the side of given knowledge. Then the standard Bayes formula implies 

\begin{align}
\Pr[a,b,c]&=\Pr[a|b,c]\Pr[b,c]\\
\Pr[a,b,c]&=\Pr[b|a,c]\Pr[a,c]
\end{align}

From which we can conclude 

\begin{align}
\Pr[a|b,c]= \frac{\Pr[b|a,c]\Pr[a,c]}{\Pr[b,c]}=
\frac{\Pr[b|a,c]\Pr[a|c]\Pr[c]}{\Pr[b|c]\Pr[c]}=\frac{\Pr[b|a,c]\Pr[a|c]}{\Pr[b|c]}\enspace.
\end{align}

The resulting formula

\begin{align}
\Pr[a|b,c]=\frac{\Pr[b|a,c]\Pr[a|c]}{\Pr[b|c]}
\end{align}

is known as extended Bayers formula.
</div>


In the following, we apply the extended Bayes formula with the following substitutions
\begin{align}
a&\equiv V=v\\
b&\equiv \color{blue}{\mathsf{evidence}^-}(V)\\
c&\equiv \color{red}{\mathsf{evidence}^+}(V)
\end{align}


The resulting mechanical application of Bayes rule yields

\begin{align*}
p_V(v)
&= \Pr[V=v|\color{red}{\mathsf{evidence}^+}(V),\color{blue}{\mathsf{evidence}^-}(V)]\\
&=\frac{\Pr[\color{blue}{\mathsf{evidence}^-}(V)|V=v,\color{red}{\mathsf{evidence}^+}(V)]
  \cdot\Pr[V=v|\color{red}{\mathsf{evidence}^+}(V)]}{\Pr[\color{blue}{\mathsf{evidence}^-}(V)|\color{red}{\mathsf{evidence}^+}(V)]}\\
&\propto \Pr[\color{blue}{\mathsf{evidence}^-}(V)|V=v,\color{red}{\mathsf{evidence}^+}(V)]
  \cdot\Pr[V=v|\color{red}{\mathsf{evidence}^+}(V)]\enspace.
\end{align*}

As direct knowledge of the state $V=v$ completely determines what happens with the next node, the knowledge $V=v,\color{red}{\mathsf{evidence}^+}(V)$ is equivalent to the knowledge $V=v$ and we can simplify:

\begin{align*}
p_V(v)
&\propto \Pr[\color{blue}{\mathsf{evidence}^-}(V)|V=v]
  \cdot\Pr[V=v|\color{red}{\mathsf{evidence}^+}(V)]\\
&\propto \lambda_V(v)\cdot \pi_V(v)\enspace.
\end{align*}

As a result, if we know the likelihood $\lambda_V(\cdot)$ and posterior $\pi_V(\cdot)$ up to a constant then we can recover the marginal posterior $p_V(\cdot)$ through normalisation. 
Up to a constant in this context means that we can omit all factors that do not depend on the value $v$.

### Likelihood update for a node without evidence

<img src = '../illustrations/belief-propagation-in-chain-likelihood-i.png' width=100%>

Let $W$ be a successor node of $V$, then mechanical application of marginalisation rule yields

\begin{align*}
\lambda_V(v)
&=\Pr[\color{blue}{\mathsf{evidence}^-}(V)|V=v]\\
&=\sum_{w\in W}\Pr[\color{blue}{\mathsf{evidence}^-}(V)\wedge W=w|V=v]\\
&=\sum_{w\in W}\Pr[\color{blue}{\mathsf{evidence}^-}(V)|W=w,V=v]\cdot \Pr[W=w|V=v] \enspace.
\end{align*}

As the node $V$ has no evidence, the downstream evidence must be in the successor nodes and thus $\color{blue}{\mathsf{evidence}^-}(V)= \color{blue}{\mathsf{evidence}^-}(W)$. 
The Markov property assures that knowledge of $V=v$ is redundant when we know $W=w$. 
Consequently, we get

\begin{align*}
\lambda_V(v)
&=\sum_{w\in W}\Pr[\color{blue}{\mathsf{evidence}^-}(W)|W=w]\cdot \Pr[W=w|V=v] \\
&=\sum_{w\in W}\lambda_W(w)M_{V\to W}[v, w] \enspace.
\end{align*}

Representing $\lambda_W(\cdot)$ as a column vector allows us to compact the equation in matrix notation:

\begin{align*}
\lambda_V\propto M_{V\to W} \lambda_W\enspace.
\end{align*}

### Likelihood update for a node with direct evidence 

<img src = '../illustrations/belief-propagation-in-chain-likelihood-ii.png' width=100%>

Let $W$ be a successor node of $V$ and let $V=v_*$ be a direct evidence associated with the node $V$.
Then evidence decomposition yieds

\begin{align*}
\lambda_V(v)
&=\Pr[\color{blue}{\mathsf{evidence}^-}(V)|V=v]\\
&=\Pr[\color{blue}{\mathsf{evidence}^-}(W)\wedge V=v_*|V=v]\\
&=\Pr[\color{blue}{\mathsf{evidence}^-}(W)|V=v_*,V=v]\cdot \Pr[V=v_*|V=v]\\
&=\Pr[\color{blue}{\mathsf{evidence}^-}(W)|V=v_*]\cdot [v_*=v]\enspace.
\end{align*}

Note that $\lambda_V(v)$ is nonzero only for a single value $v_*$. 
Thus by multiplying $\lambda_V(v)$ with a constant value $\lambda_V(v_*)^{-1}$, we get an indicator:

\begin{align*}
\lambda_V(v)\propto [v=v_*]\enspace.
\end{align*}

Note that  $\lambda_V(v_*)^{-1}$ depends on $v_*$ but remains constant if we consider different values of $v\in V$.

### Likelihood update for a node with indirect evidence 

<img src = '../illustrations/belief-propagation-in-chain-likelihood-ii.png' width=100%>

Let $W$ be a successor node of $V$ and let $\varepsilon_V$ be an indirect evidence associated with the node $V$.
Then evidence decomposition yieds

\begin{align*}
\lambda_V(v)
&=\Pr[\color{blue}{\mathsf{evidence}^-}(V)|V=v]\\
&=\Pr[\color{blue}{\mathsf{evidence}^-}(W)\wedge \varepsilon_v|V=v]\\
&=\Pr[\color{blue}{\mathsf{evidence}^-}(W)|\varepsilon_V,V=v]\cdot \Pr[\varepsilon_V|V=v]\enspace.
\end{align*}

Again the direct knowledge $V=v$ subsumes the partial knowledge $\varepsilon_V$ and we get

\begin{align*}
\lambda_V(v)
&=\Pr[\color{blue}{\mathsf{evidence}^-}(W)|V=v]\cdot \lambda_V^*(v)\\
\end{align*}

where $\lambda_V^*(v)=\Pr[\varepsilon_V|V=v]$ is the local likelihood.
Now there can be several states for which the likelihood is nonzero and thus we must separately compute the left term.
Again the mechanical application of the marginalisation rule yields

\begin{align*}
\lambda_1(v)
&=\Pr[\color{blue}{\mathsf{evidence}^-}(W)|V=v]\\
&=\sum_{w\in W} \Pr[\color{blue}{\mathsf{evidence}^-}(W)\wedge W=w|V=v]\\
&=\sum_{w\in W} \Pr[\color{blue}{\mathsf{evidence}^-}(W)|W=w, V=v]\cdot \Pr[W=w|V=v]\\
&=\sum_{w\in W} \Pr[\color{blue}{\mathsf{evidence}^-}(W)|W=w]\cdot M_{V\to W}[v,w]\\
&=\sum_{w\in W} \lambda_W(w)\cdot M_{V\to W}[v,w] \enspace.
\end{align*}

Thus matrix algebra allows us to compact the update rule:
\begin{align*}
\lambda_1 &= M_{V\to W}\lambda_W\\
\lambda_V &= \lambda_1\otimes \lambda_V^*
\end{align*}
where $\otimes$ represents pointwise multiplication of vector entries.

### Likelihood update for a node without successors

<img src = '../illustrations/belief-propagation-in-chain-likelihood-iii.png' width=100%>


The rules for updating the likelihood are applicable for nodes that do have successors.
Hence, we must address nodes without successors explicitly. 
Without loss of generality, we can assume that for such a node $V$ there is evidence $\varepsilon_V$. If the evidence is missing we can treat this as a partial observation $v\in V$ that creates a local likelihood $\lambda_V^*=1$. If the evidence is direct then it creates a local likelihood $\lambda_V^*(v)=[v=v_*]$.  
As a result, we get

\begin{align*}
\lambda_V(v)
&=\Pr[\color{blue}{\mathsf{evidence}^-}(V)|V=v]\\
&=\Pr[\varepsilon_V|V=v]\\
&=\lambda_V^*(v)\enspace.
\end{align*}

### Prior update  for a node without evidence 

<img src = '../illustrations/belief-propagation-in-chain-prior-i.png' width=100%>

Let $U$ be the predcessor node of $V$ then mechanical application of marginalisation rule yields

\begin{align*}
\pi_V(v)
&=\Pr[V=v|\color{red}{\mathsf{evidence}^+}(V)]\\
&=\sum_{u\in U}\Pr[V=v\wedge U=u|\color{red}{\mathsf{evidence}^+}(V)]\\
&=\sum_{u\in U}\Pr[V=v|U=u,\color{red}{\mathsf{evidence}^+}(V)]\cdot \Pr[U=u|\color{red}{\mathsf{evidence}^+}(V)]\enspace.
\end{align*}

As the node $V$ has no evidence, the upstream evidence must be in the predecessor nodes and thus $\color{red}{\mathsf{evidence}^+}(V)= \color{red}{\mathsf{evidence}^+}(U)$. 
The Markov property assures that knowledge of $\color{red}{\mathsf{evidence}^+}(U)$ is redundant when we know $U=u$. 
Consequently, we get

\begin{align*}
\pi_V(v)
&=\sum_{u\in U}\Pr[V=v|U=u]\cdot \Pr[U=u|\color{red}{\mathsf{evidence}^+}(U)]\\
&=\sum_{u\in U}M_{U\to V}[u,v]\cdot \pi_U(u)\enspace.
\end{align*}

Representing $\pi_U(\cdot)$ as a row vector allows us to compact the equation in matrix notation:

\begin{align*}
\pi_V\propto \pi_U M_{U\to V}\enspace.
\end{align*}


### Prior update  for a node with direct evidence 

<img src = '../illustrations/belief-propagation-in-chain-prior-ii.png' width=100%>

Let $U$ be a predecessor node of $V$ and let $V=v_*$ be a direct evidence associated with the node $V$.
Then the evidence decoposition yields

\begin{align*}
\pi_V(v)
&=\Pr[V=v|\color{red}{\mathsf{evidence}^+}(V)]\\
&=\Pr[V=v|V=v_*,\color{red}{\mathsf{evidence}^+}(U)]\enspace.
\end{align*}

Again the evidence $V=v_*$ is the most direct information about $V$, the remaining evidence $\color{red}{\mathsf{evidence}^+}(U)$ is irrelevant unless $\color{red}{\mathsf{evidence}^+}(U)$ directly contradicts $V=v_*$.
In this case, nothing can be done and prior is not defined at all.
Thus we can simplify and get an indicator prior:

\begin{align*}
\pi_V(v)
&=\Pr[V=v|V=v_*]\\
&=[v=v_*]\enspace.
\end{align*}

### Prior update  for a node with indirect evidence 

<img src = '../illustrations/belief-propagation-in-chain-prior-ii.png' width=100%>

Let $U$ be a predecessor node of $V$ and let $\varepsilon_V$ be an indirect evidence associated with the node $V$.
Then the evidence decoposition yields

\begin{align*}
\pi_V(v)
&=\Pr[V=v|\color{red}{\mathsf{evidence}^+}(V)]\\
&=\Pr[V=v|\varepsilon_V,\color{red}{\mathsf{evidence}^+}(U)]\enspace.
\end{align*}

Mechanical application of Bayes rule yields

\begin{align*}
\pi_V(v)
&=\Pr[V=v|\varepsilon_V,\color{red}{\mathsf{evidence}^+}(U)]\\
&=\frac{\Pr[\varepsilon_V|V=v,\color{red}{\mathsf{evidence}^+}(U)]\cdot\Pr[V=v|\color{red}{\mathsf{evidence}^+}(U)]}{\Pr[\varepsilon_V|\color{red}{\mathsf{evidence}^+}(U)]}\\
&\propto\Pr[\varepsilon_V|V=v,\color{red}{\mathsf{evidence}^+}(U)]\cdot\Pr[V=v|\color{red}{\mathsf{evidence}^+}(U)]\enspace.
\end{align*}

As $\color{red}{\mathsf{evidence}^+}(U)$ is redundant if we know $V=v$, we get

\begin{align*}
\pi_V(v)
&\propto\Pr[\varepsilon_V|V=v]\cdot\Pr[V=v|\color{red}{\mathsf{evidence}^+}(U)]\\
&\propto\lambda_V^*(v)\cdot\Pr[V=v|\color{red}{\mathsf{evidence}^+}(U)]
\end{align*}

where $\lambda_V^*(v)=\Pr[\varepsilon_V|V=v]$ is the local likelihood. 
Let $\pi_1(v)$ denote the second factor. Then mechanical application of the marginalisation rule yields

\begin{align*}
\pi_1(v)
&=\Pr[V=v|\color{red}{\mathsf{evidence}^+}(U)]\\
&=\sum_{u\in U}\Pr[V=v\wedge U=u|\color{red}{\mathsf{evidence}^+}(U)]\\
&=\sum_{u\in U}\Pr[V=v|U=u,\color{red}{\mathsf{evidence}^+}(U)]\cdot\Pr[U=u|\color{red}{\mathsf{evidence}^+}(U)]\\
&=\sum_{u\in U}\Pr[V=v|U=u]\cdot\Pr[U=u|\color{red}{\mathsf{evidence}^+}(U)]\\
&=\sum_{u\in U}M_{U\to V}[u,v]\cdot\pi_U(u)\enspace.
\end{align*}

Thus matrix algebra allows us to compact the update rule:
\begin{align*}
\pi_1 &= \pi_U M_{U\to V}\\
\pi_V &= \pi_1\otimes \lambda_V^*
\end{align*}
where $\otimes$ represents pointwise multiplication of vector entries.

### Prior update for a node without  a predecessor

<img src = '../illustrations/belief-propagation-in-chain-prior-iii.png' width=100%>


The rules for updating the prior are applicable for nodes that do have predecessors.
Hence, we must address nodes without predecessors explicitly. 
Without loss of generality, we can assume that for such a node $V$ there is evidence $\varepsilon_V$. If the evidence is missing we can treat this as a partial observation $v\in V$ that creates a local likelihood $\lambda_V^*=1$. If the evidence is direct then it creates a local likelihood $\lambda_V^*(v)=[v=v_*]$. 
As a result, we get an expression that can be further manipulated with the Bayes rule:

\begin{align*}
\pi_V(v)
&=\Pr[V=v|\color{red}{\mathsf{evidence}^+}(V)]\\
&=\Pr[V=v|\varepsilon_V]\\
&=\frac{\Pr[\varepsilon_V|V=v]\cdot \Pr[V=v]}{\Pr[\varepsilon_V]}\\
&\propto\lambda_V^*(v)\cdot M_{V}[v]\\
\end{align*}

where $M_V$ is the vector of initial probabilities. The corresponding matrix formulation is

\begin{align*}
\pi_V\propto \lambda_V^*\otimes M_V\enspace.
\end{align*}

For a direct observation the expression simplifies to 

\begin{align*}
\pi_V\propto [v=v_*]
\end{align*}

as only one entry is nonzero.


