# Complete derivation of belief update rules for trees


## I. Formal definitions

### Evidence

<img src = '../illustrations/belief-propagation-in-tree-evidence-i.png' width=100%>


* **Direct evidence** $\varepsilon_V$ for a node $V$ is an observation $V=v_*$ that determines the **local likelihood** $\lambda_V^*(v)=[v=v_*]$.
* **Indirect evidence** $\varepsilon_V$ for a node $V$ is a partial observation that determines the **local likelihood** $\lambda_V^*(v)=\Pr[\varepsilon_V|V=v]$. 


As indirect evidence $\varepsilon_V$ can be modelled by adding a new successor $V_*$ to the node $V$ with the conditional distribution $\Pr[V_*=\varepsilon_V|V=v]$
determined by the local likelihood $\lambda_V^*(v)$, we do not consider indirect evidence in the further analysis.
For the chains, we needed to analyse the effect of the indirect evidence separately as this extra node converts a chain into a tree.

 ### Evidence partitioning
 
 * **Evidence** $\mathsf{evidence}$ is the summary evidence of all nodes in the tree. 
 * **Upstream evidence** $\color{red}{\mathsf{evidence}^+}(V)$ is the evidence of all nodes reachable through a predecessor of $V$ together with the evidence for $V$.
 * **Downstream evidence** $\color{blue}{\mathsf{evidence}^-}(V)$ is the evidence of all nodes succeeding $V$ together with the evidence for $V$.


 <img src = '../illustrations/belief-propagation-in-tree-evidence-ii.png' width=100%>

In this figure, upstream and downstream evidence for the node $D$ are the following:

\begin{align*}
\color{red}{\mathsf{evidence}^+}(D)  &= \{\varepsilon_A, \varepsilon_D, \varepsilon_I\} \\
\color{blue}{\mathsf{evidence}^-}(D) &= \{\varepsilon_D, \varepsilon_G\}\enspace.
\end{align*}



## II. Derivation of iterative update rules

### Marginal posterior probabilities

 <img src = '../illustrations/belief-propagation-in-tree-marginal-posterior-i.png' width=100%>


Mechanical application of Bayes rule yields

\begin{align*}
p_V(v)
&= \Pr[V=v|\color{red}{\mathsf{evidence}^+}(V),\color{blue}{\mathsf{evidence}^-}(V)]\\
&=\frac{\Pr[\color{blue}{\mathsf{evidence}^-}(V)|V=v,\color{red}{\mathsf{evidence}^+}(V)]
  \cdot\Pr[V=v|\color{red}{\mathsf{evidence}^+}(V)]}{\Pr[\color{blue}{\mathsf{evidence}^-}(V)|\color{red}{\mathsf{evidence}^+}(V)]}\\
&\propto \Pr[\color{blue}{\mathsf{evidence}^-}(V)|V=v,\color{red}{\mathsf{evidence}^+}(V)]
  \cdot\Pr[V=v|\color{red}{\mathsf{evidence}^+}(V)]\enspace.
\end{align*}

As direct knowledge of the state $V=v$ completely determines what happens with the next node, the knowledge $V=v,\color{red}{\mathsf{evidence}^+}(V)$ is equivalent to the knowledge $V=v$ and we can simplify:

\begin{align*}
p_V(v)
&\propto \Pr[\color{blue}{\mathsf{evidence}^-}(V)|V=v]
  \cdot\Pr[V=v|\color{red}{\mathsf{evidence}^+}(V)]\\
&\propto \lambda_V(v)\cdot \pi_V(v)\enspace.
\end{align*}

As a result, if we know the likelihood $\lambda_V(\cdot)$ and posterior $\pi_V(\cdot)$ up to a constant then we can recover the marginal posterior $p_V(\cdot)$ through normalisation. 
Up to a constant in this context means that we can omit all factors that do not depend on the value $v$. 




### Likelihood update for a node without evidence

<img src = '../illustrations/belief-propagation-in-tree-likelihood-i.png' width=100%>

Let $W_1,\ldots,W_k$ be direct successor nodes of $V$, then the downstream evidence decomposes into $k$ classes

\begin{align*}
\color{blue}{\mathsf{evidence}^-}(V)=\color{blue}{\mathsf{evidence}^-}(W_1)\cup\ldots\cup \color{blue}{\mathsf{evidence}^-}(W_k)
\end{align*}

as the node $V$ has no evidence.
Moreover, these events are independent for a fixed $V=v$ value as they occur in different tree brances. Consequently,

\begin{align*}
\lambda_V(v)
&=\Pr[\color{blue}{\mathsf{evidence}^-}(V)|V=v]\\
&=\Pr[\color{blue}{\mathsf{evidence}^-}(W_1)\wedge\ldots\wedge \color{blue}{\mathsf{evidence}^-}(W_k)|V=v]\\
&=\Pr[\color{blue}{\mathsf{evidence}^-}(W_1)|V=v]\cdots \Pr[\color{blue}{\mathsf{evidence}^-}(W_k)|V=v]\enspace.
\end{align*}

Mechanical application of marginalisation rule to one of the terms yields

\begin{align*}
\lambda_j(v)
&=\Pr[\color{blue}{\mathsf{evidence}^-}(W_j)|V=v]\\
&=\sum_{w_j\in W_j}\Pr[\color{blue}{\mathsf{evidence}^-}(W_j)\wedge W_j=w_j|V=v]\\
&=\sum_{w_j\in W_j}\Pr[\color{blue}{\mathsf{evidence}^-}(W_j)|W_j=w_j,V=v]\cdot \Pr[W_j=w_j|V=v]\enspace.
\end{align*}

The Markov property assures that knowledge of $V=v$ is redundant when we know $W_j=w_j$. 
Consequently, we get

\begin{align*}
\lambda_j(v)
&=\sum_{w_j\in W_j}\Pr[\color{blue}{\mathsf{evidence}^-}(W_j)|W_j=w_j]\cdot \Pr[W_j=w_j|V=v] \\
&=\sum_{w_j\in W_j}\lambda_{W_j}(w_j)M_{V\to W_j}[v, w_j]\enspace.
\end{align*}

Representing $\lambda_j(\cdot)$ and $\lambda_{W_j}(\cdot)$ as column vectors allows us to compact the equation in matrix notation:

\begin{align*}
\lambda_j&= M_{V\to W_j} \lambda_{W_j}\\
\lambda_V&=\lambda_1\otimes\ldots\otimes\lambda_k\enspace.
\end{align*}

### Likelihood update for a node with direct evidence 

<img src = '../illustrations/belief-propagation-in-tree-likelihood-ii.png' width=100%>

Let $V=v_*$ be a direct evidence associated with the node $V$ and let

\begin{align*}
\color{blue}{\mathsf{evidence}^-_*}(V)= \color{blue}{\mathsf{evidence}^-}(V)\setminus \{V=v_*\}
\end{align*}

be the evidence associated with the node $V$ downstream of $V$.
Then evidence decomposition yieds

\begin{align*}
\lambda_V(v)
&=\Pr[\color{blue}{\mathsf{evidence}^-}(V)|V=v]\\
&=\Pr[\color{blue}{\mathsf{evidence}^-_*}(V)\wedge V=v_*|V=v]\\
&=\Pr[\color{blue}{\mathsf{evidence}^-_*}(V)|V=v_*,V=v]\cdot \Pr[V=v_*|V=v]\\
&=\Pr[\color{blue}{\mathsf{evidence}^-_*}(V)|V=v_*]\cdot [v_*=v]\enspace.
\end{align*}

Note that $\lambda_V(v)$ is nonzero only for a single value $v_*$. 
Thus by multiplying $\lambda_V(v)$ with a constant value $\lambda_V(v_*)^{-1}$, we get an indicator:

\begin{align*}
\lambda_V(v)\propto [v=v_*]\enspace.
\end{align*}

Note that  $\lambda_V(v_*)^{-1}$ depends on $v_*$ but remains constant if we consider different values of $v\in V$.

### Likelihood update for a node without successors

<img src = '../illustrations/belief-propagation-in-tree-likelihood-iii.png' width=100%>


The rules for updating the likelihood are applicable for nodes that do have successors.
Hence, we must address nodes without successors explicitly. 
If the node has direct evidence $V=v_*$ then it is the entire downstream evidence $\color{blue}{\mathsf{evidence}^-}(V)$ and thus

\begin{align*}
\lambda_V(v)
&=\Pr[\color{blue}{\mathsf{evidence}^-}(V)|V=v]\\
&=\Pr[V=v_*|V=v]\\
&=[v=v_*]\enspace.
\end{align*}

If the node does not have evidence then the entire downstream evidence $\color{blue}{\mathsf{evidence}^-}(V)$ is empty and thus

\begin{align*}
\lambda_V(v)
&=\Pr[\color{blue}{\mathsf{evidence}^-}(V)|V=v]\\
&=\Pr[\mathrm{True}|V=v]\\
&=1\enspace.
\end{align*}


### Prior update  if a node and its predecessor are without evidence 

<img src = '../illustrations/belief-propagation-in-tree-prior-i.png' width=100%>

Let $U$ be a predecessor node of $V$ then mechanical application of marginalisation rule yields

\begin{align*}
\pi_V(v)
&=\Pr[V=v|\color{red}{\mathsf{evidence}^+}(V)]\\
&=\sum_{u\in U}\Pr[V=v\wedge U=u|\color{red}{\mathsf{evidence}^+}(V)]\\
&=\sum_{u\in U}\Pr[V=v|U=u,\color{red}{\mathsf{evidence}^+}(V)]\cdot \Pr[U=u|\color{red}{\mathsf{evidence}^+}(V)]\enspace.
\end{align*}

As the node $V$ has no evidence, the upstream evidence must be reachable through the predecessor node $U$.
However, this evidence is not only $\color{red}{\mathsf{evidence}^+}(U)$ if $U$ has more child nodes than just $V$.
Let $W_1,\ldots,W_{k}$ denote the children of $U$ so that $W_k=V$. Then the upstream evidence of $V$ decomposes into up- and downstream evidence:

\begin{align*}
\color{red}{\mathsf{evidence}^+}(V)=\color{red}{\mathsf{evidence}^+}(U)\cup \color{blue}{\mathsf{evidence}^-}(W_1)\cup \ldots \cup\color{blue}{\mathsf{evidence}^-}(W_{k-1})\enspace.
\end{align*}

The Markov property assures that knowledge of $\color{red}{\mathsf{evidence}^+}(U)$ is redundant when we know $U=u$. 
Consequently, we get

\begin{align*}
\pi_V(v)
&=\sum_{u\in U}\Pr[V=v|U=u,\color{red}{\mathsf{evidence}^+}(V)]\cdot \Pr[U=u|\color{red}{\mathsf{evidence}^+}(V)]\\
&=\sum_{u\in U}\Pr[V=v|U=u]\cdot \Pr[U=u|\color{red}{\mathsf{evidence}^+}(U), \color{blue}{\mathsf{evidence}^-}(W_1), \ldots, \color{blue}{\mathsf{evidence}^-}(W_{k-1})]\\
&=\sum_{u\in U}M_{U\to V}[u,v]\cdot\frac{\Pr[\color{blue}{\mathsf{evidence}^-}(W_1), \ldots, \color{blue}{\mathsf{evidence}^-}(W_{k-1})|U=u,\color{red}{\mathsf{evidence}^+}(U)]\cdot\Pr[U=u|\color{red}{\mathsf{evidence}^+}(U)]}{\Pr[\color{blue}{\mathsf{evidence}^-}(W_1), \ldots, \color{blue}{\mathsf{evidence}^-}(W_{k-1})|\color{red}{\mathsf{evidence}^+}(U)]}\\
&\propto \sum_{u\in U}M_{U\to V}[u,v]\cdot \Pr[\color{blue}{\mathsf{evidence}^-}(W_1), \ldots, \color{blue}{\mathsf{evidence}^-}(W_{k-1})|U=u,\color{red}{\mathsf{evidence}^+}(U)]\cdot\Pr[U=u|\color{red}{\mathsf{evidence}^+}(U)]
\end{align*}

where the last line follows from the fact that the denominator is a constant that does not depend on the values of $u\in U$ and $v\in V$.

The Markov property assures that knowledge of $\color{red}{\mathsf{evidence}^+}(U)$ is redundant when we know $U=u$ and thus we can express

\begin{align*}
\pi_V(v)
&\propto \sum_{u\in U}M_{U\to V}[u,v]\cdot \Pr[\color{blue}{\mathsf{evidence}^-}(W_1), \ldots, \color{blue}{\mathsf{evidence}^-}(W_{k-1})|U=u]\cdot\Pr[U=u|\color{red}{\mathsf{evidence}^+}(U)]\\
&\propto \sum_{u\in U}M_{U\to V}[u,v]\cdot \Pr[\color{blue}{\mathsf{evidence}^-}(W_1), \ldots, \color{blue}{\mathsf{evidence}^-}(W_{k-1})|U=u]\cdot \pi_U(u)\\
&\propto \sum_{u\in U}M_{U\to V}[u,v]\cdot \pi_U(u)\cdot \Pr[\color{blue}{\mathsf{evidence}^-}(W_1), \ldots, \color{blue}{\mathsf{evidence}^-}(W_{k-1})|U=u]\enspace.
\end{align*}

Let us multiply and divide the summation term by the factor $\Pr[\color{blue}{\mathsf{evidence}^-}(W_{k})|U=u]$ to simplify the derivation.
Then 

\begin{align*}
\pi_V(v)
&\propto \sum_{u\in U}\frac{M_{U\to V}[u,v]}{\Pr[\color{blue}{\mathsf{evidence}^-}(W_{k})|U=u]}\cdot \pi_U(u)\cdot \Pr[\color{blue}{\mathsf{evidence}^-}(W_1), \ldots, \color{blue}{\mathsf{evidence}^-}(W_{k-1})|U=u]\cdot\Pr[\color{blue}{\mathsf{evidence}^-}(W_{k})|U=u]\\
&\propto \sum_{u\in U}\frac{M_{U\to V}[u,v]}{\Pr[\color{blue}{\mathsf{evidence}^-}(W_{k})|U=u]}\cdot \pi_U(u)\cdot \Pr[\color{blue}{\mathsf{evidence}^-}(W_1), \ldots, \color{blue}{\mathsf{evidence}^-}(W_{k})|U=u]\\
\end{align*}

where the last equation follows from the fact that branches starting from $W_1,\ldots, W_k$ are independent given the value $U=u$.

As the node $U=u$ does not have a direct evidence $U=u_*$ linked to it, the last factor is the likelihood of $U$ by definition and we get

\begin{align*}
\pi_V(v)
&\propto \sum_{u\in U}M_{U\to V}[u,v]\cdot \pi_U(u)\cdot\frac{\lambda_U(u)}{\Pr[\color{blue}{\mathsf{evidence}^-}(V)|U=u]}\enspace.
\end{align*}

Recall that the likelihood $\lambda_U(u)$ splits into the product $\lambda_1(u)\ldots,\lambda_k(u)$ by the likelihood update rule and thus

\begin{align*}
\pi_V(v)
&\propto \sum_{u\in U}M_{U\to V}[u,v]\cdot \frac{\pi_U(u)\lambda_U(u)}{\lambda_{k}(u)}\\
&\propto \sum_{u\in U}M_{U\to V}[u,v]\cdot \frac{p_U(u)}{\lambda_{k}(u)}
\end{align*}

where

\begin{align*}
\lambda_k(u)
&=\sum_{v\in V}\lambda_{V}(v)M_{U\to V}[u, v]\enspace.
\end{align*}

Representing $\pi_V(\cdot)$ and  $p_U(\cdot)$ as row vectors allows us to compact the equation in matrix notation:

\begin{align*}
\pi_V\propto \frac{p_U}{\lambda_k} M_{U\to V} 
\end{align*}

where the division line represents element-wise division of vectors. 
If the predecessor has only one child node then the expression simplifies to

\begin{align*}
\pi_V\propto \pi_U M_{U\to V} 
\end{align*}

as expected (this is the prior update formula for chains).


### Prior update  for a node with direct evidence 

<img src = '../illustrations/belief-propagation-in-tree-prior-ii.png' width=100%>

Let $U$ be the predecessor node of $V$ and let $V=v_*$ be the direct evidence associated with the node $V$.
Then the evidence decoposition yields

\begin{align*}
\pi_V(v)
&=\Pr[V=v|\color{red}{\mathsf{evidence}^+}(V)]\\
&=\Pr[V=v|V=v_*,\color{red}{\mathsf{evidence}^+_*}(V)]\\
\end{align*}

where $\color{red}{\mathsf{evidence}^+_*}(V)=\color{red}{\mathsf{evidence}^+}(V)\setminus \{V=v_*\}$ denotes the remaining evidence upstream of $V$.
Again the evidence $V=v_*$ is the most direct information about $V$, the remaining evidence $\color{red}{\mathsf{evidence}^+_*}(V)$ is irrelevant unless $\color{red}{\mathsf{evidence}^+_*}(V)$ directly contradicts $V=v_*$.
In this case, nothing can be done and prior is not defined at all.

Thus, we can simplify and get an indicator prior:

\begin{align*}
\pi_V(v)
&=\Pr[V=v|V=v_*]\\
&=[v=v_*]\enspace.
\end{align*}





### Prior update  if a node is without evidence  while its predecessor is with direct evidence 

<img src = '../illustrations/belief-propagation-in-tree-prior-iii.png' width=100%>


In the analysis above we obtained the formula 
\begin{align*}
\pi_V(v)
&\propto \sum_{u\in U}M_{U\to V}[u,v]\cdot \pi_U(u)\cdot \Pr[\color{blue}{\mathsf{evidence}^-}(W_1), \ldots, \color{blue}{\mathsf{evidence}^-}(W_{k-1})|U=u]\\
\end{align*}
that holds for any predecessor $U$, provided that $V$ is without direct evidence.
If the node $U$ has direct evidence $U=u_*$ then  $\pi_U(u)\propto[u=u_*]$ and consequently the sum reduces to a single term:

\begin{align*}
\pi_V(v)
&\propto M_{U\to V}[u_*,v]\cdot  \Pr[\color{blue}{\mathsf{evidence}^-}(W_1), \ldots, \color{blue}{\mathsf{evidence}^-}(W_{k-1})|U=u_*]\enspace.
\end{align*}

Morover, the second factor does not depend on the value of $v$ and thus we can further simplify:

\begin{align*}
\pi_V(v)
&\propto M_{U\to V}[u_*,v]\\
&\propto \sum_{u\in U} \pi_U(u)M_{U\to V}[u,v]\enspace.
\end{align*}

The corresponding matrix algebra formulation is

\begin{align*}
\pi_V
&\propto \pi_U M_{U\to V}\\
&\propto \frac{\pi_U}{\lambda_k}M_{U\to V}
\end{align*}

which is formally the same as for the non-exceptional case, although the special case formula is clearer and easier to understand. 

### Prior update for a node without a predecessor

<img src = '../illustrations/belief-propagation-in-tree-prior-iv.png' width=100%>


The rules for updating the prior are applicable for nodes that do have predecessors.
Hence, we must address nodes without predecessors explicitly. 
If there is a direct evidence $V=v_*$ then obviously

\begin{align*}
\pi_V(v)
&=\Pr[V=v|\color{red}{\mathsf{evidence}^+}(V)]\\
&=\Pr[V=v|V=v_*]\\
&=[v=v_*]\\
\end{align*}

and thus
\begin{align*}
\pi_V\propto [v=v_*]\enspace.
\end{align*}




If there is no evidence then by definition

\begin{align*}
\pi_V(v)
&=\Pr[V=v|\color{red}{\mathsf{evidence}^+}(V)]\\
&=\Pr[V=v]\\
&=M_V[v]
\end{align*}

where $M_V$ is the vector of initial probabilities. The corresponding matrix formulation is

\begin{align*}
\pi_V\propto
M_V\enspace.
\end{align*}