# Exact Inference Algorithms



## CSCI E-83
## Stephen Elston

We have seen several approaches to representation of probabilistic graphical models. Now, we will turn our attention to **inference algorithms**. The goal of inference is to compute the **posterior distribution** of one or more variables in the model given **evidence**. Alternatively, we can say that inference is used to return results to a **query** on the model. 

In this discussion we differentiate between inference and learning. With Bayesian models, this distinction can be rather arbitrary, but fits the nature of the discussion herein. The role of inference in an intelligent agent is illustrated in the figure below. 

<img src="img/Inference.JPG" alt="Drawing" style="width:400px; height:200px"/>
<center> **Inference in an intelligent agent** </center>

In this lesson we will examine three efficient classes of algorithms for inference on graphical models:

1. **Variable elimination:**
2. **Message passing, or sum-product or belief propagation algorithms:**
3. **Junction tree algorithm:**


**Suggested readings:** The following reading is an optional supplement to the material presented here:
- Barber, Sections 5.1, 5.2 (optional), 5.3, 6.1, 6.2, 6.3, 6.4, 6.5, 6.6, or
- Murphy, Section 20.1, 20.2, 20.3, 20.4.

## Complexity of inference for graphical models

To understand the need for efficient inference algorithms it helps to understand the computational complexity of inference of a graphical model. If we use a **tabular** solution algorithm, the complexity is NP; meaning the number of operations required grow as order $= O(n^k)$. 

On the face of things, it might seem that performing inference on graphical models of any scale is hopeless. While it is true, there are no general algorithms for solving the inference problem, there are many practical and widely applicable cases for which efficient inference algorithms exist. 

The key to reducing the computational complexity of graphical model inference algorithms is use of independencies. The naive approach is to simple the full table of marginal distributions of the graph variables. This approach has combinatorial or NP complexity. By combining conditional probabilities with evidence, the complexity of marginal influence can be significantly reduced. 

The algorithms we will explore take advantage of special structures commonly found in model graphs. Part of the tick is to rearrange the graph to create the desired structure. These algorithms combined with conditional probabilities and evidence result in even large scale models becoming tractable.    

## Elimination algorithms

To understand the **elimination algorithm**, let's use an example based on a **chain graph**. Chain graphs occur in a wide range of applications including protein activation models. An example is shown in the figure below.

<img src="img/Chain1.JPG" alt="Drawing" style="width:400px; height:75px"/>
<center> **Chain graph** </center>

Our goal is to to compute the marginal distribution of $Z$, $P(Z)$:

$$P(Z) = \sum_V \sum_W \sum_X \sum_Y P(V,W,X,Y,Z)$$

We can decompose this distribution as follows:

$$P(Z) = \sum_V \sum_W \sum_X \sum_Y P(V)\ P(W \ |\ V)\ P(X\ |\ W)\ P(Y\ |\ X)\ P(Z\ |\ Y)$$

We can rearrange these terms as follows:

$$P(Z) =  \sum_W \sum_X \sum_Y P(X\ |\ W)\ P(Y\ |\ X)\ P(Z\ |\ Y) \sum_V P(V)\ P(W \ |\ V)$$

Now:

$$p(W) = \sum_V P(V)\ P(W \ |\ V)$$

So we can rewrite the marginal distribution as:

$$P(Z) =  \sum_W \sum_X \sum_Y P(X\ |\ W)\ P(Y\ |\ X)\ P(Z\ |\ Y) p(W)$$

We have **eliminated** $V$ from the graph as shown in the figure below. 

<img src="img/Eliminate1.JPG" alt="Drawing" style="width:400px; height:75px"/>
<center> **Eliminate V from the chain graph** </center>

Only a **local cost** has been paid in this elimination. My local cost we mean that the summation was only over the variable $V$. 

We can continue the process by eliminating $W$ using local summation:

$$P(Z) =  \sum_X \sum_Y P(Y\ |\ X)\ P(Z\ |\ Y) \sum_W p(W)\ P(X\ |\ W) \\
= \sum_X \sum_Y P(Y\ |\ X)\ P(Z\ |\ Y)\ p(X)$$

<img src="img/Eliminate2.JPG" alt="Drawing" style="width:400px; height:75px"/>
<center> **Eliminate W from the chain graph** </center>

Continuing the process $X$ is eliminated using local summation:

$$P(Z) =  \sum_Y P(Z\ |\ Y) \sum_X p(X)\ P(Y\ |\ X) \\
= \sum_Y P(Z\ |\ Y)\ p(Y)$$


<img src="img/Eliminate3.JPG" alt="Drawing" style="width:400px; height:75px"/>
<center> **Eliminate X from the chain graph** </center>

Finally we can eliminate $Y$ using local summation to finally compute the marginal distribution of $Z$:

$$P(Z) =  \sum_Y p(Z)\ P(Z\ |\ Y)$$

<img src="img/Eliminate4.JPG" alt="Drawing" style="width:400px; height:75px"/>
<center> **Eliminate Y to compute the marginal distribution $P(Z)$** </center>  

The complexity of this elimination process is $O(kn^2)$. This compares rather favorably with the NP problem of complexity of $O(n^k)$.

## Elimination on undirected chains  

We can also apply elimination to **undirected chain graphs**. An example is shown in the figure below. 

<img src="img/Undirected1.JPG" alt="Drawing" style="width:400px; height:75px"/>
<center> **Undirected chain graph** </center>

Our goal is to to compute the marginal distribution of $Z$, $P(Z)$. We can decompose this distribution as follows:

$$P(Z) = \sum_V \sum_W \sum_X \sum_Y \frac{1}{Z} \phi(V,W)\ \phi(W,X)\ \phi(X,Y)\ \phi(Y,Z) \\
= \frac{1}{Z} \sum_W \sum_X \sum_Y  \phi(W,X)\ phi(X,Y)\ \phi(Y,Z) \sum_V \phi(V,W) $$


## A computational example

We have covered quite a bit of theory. Let's get practical and try a computational example. We will work on the student job application example. The DAG for this example is shown in the figure below.

<img src="img/LetterDAG.JPG" alt="Drawing" style="width:400px; height:200px"/>
<center> **DAG for the student score and letter distribution** </center>

Some of the inference methods in `pgmpy` rely on the `networkx` package. There are backward compatibility issues with the 2.X versions of `networkx`. If you do not have `networkx` version 1.11 installed, uncomment and execute the code in the cell below. 

In [None]:
#!pip install -I networkx==1.11

In [1]:
from pgmpy.models import BayesianModel
from pgmpy.factors.discrete import TabularCPD
from pgmpy.inference import VariableElimination

In a previous lesson, we constructed the DAG for the student example step by step. Here we will construct the DAG in one code block. The `check_model()` method is applied at the end to ensure there are no obvious errors. 

In [2]:
student_model = BayesianModel([('D', 'G'), ('I', 'G'), ('G', 'L'), ('I', 'S')])

CDP_D = TabularCPD(variable='D', variable_card=2, values=[[0.7, 0.3]])
CDP_I = TabularCPD(variable='I', variable_card=2, values=[[0.8, 0.2]])
CDP_L = TabularCPD(variable='L', variable_card=2, 
                   values=[[0.1, 0.4, 0.99],
                           [0.9, 0.6, 0.01]],
                   evidence=['G'], # Leter depends on the grade
                   evidence_card=[3])
CDP_S = TabularCPD(variable='S', variable_card=2,
                   values=[[0.95, 0.2],
                           [0.05, 0.8]],
                   evidence=['I'], # GRE score depneds on intelligence
                   evidence_card=[2])
CDP_G = TabularCPD(variable='G', variable_card=3, 
                   values=[[0.3, 0.05, 0.9,  0.5],
                           [0.4, 0.25, 0.08, 0.3],
                           [0.3, 0.7,  0.02, 0.2]],
                  evidence=['I', 'D'],
                  evidence_card=[2, 2])
student_model.add_cpds(CDP_D, CDP_I, CDP_S, CDP_G, CDP_L)
student_model.check_model()

True

Now, we are in a position to run the variable elimination algorithm on the DAG. 

In [3]:
student_infer = VariableElimination(student_model)

An indicator of computational complexity for the variable elimination algorithm is **induced width**. In summary, induced width tells us the maximum number of variables which must be eliminated in parallel to perform inference on some other variable. 

In [4]:
student_infer.induced_width(['D', 'I', 'S', 'G', 'L'])

2

In [5]:
# Computing the probability of bronc given smoke.
qur = student_infer.query(variables=['L'], evidence={'I':0, 'D':1})
print(qur['L'])

╒═════╤══════════╕
│ L   │   phi(L) │
╞═════╪══════════╡
│ L_0 │   0.7980 │
├─────┼──────────┤
│ L_1 │   0.2020 │
╘═════╧══════════╛


  phi1.values = phi1.values[slice_]


In [6]:
# Computing the probability of bronc given smoke.
qur = student_infer.query(variables=['L'], evidence={'I':1, 'D':1})
print(qur['L'])

╒═════╤══════════╕
│ L   │   phi(L) │
╞═════╪══════════╡
│ L_0 │   0.3680 │
├─────┼──────────┤
│ L_1 │   0.6320 │
╘═════╧══════════╛


  phi1.values = phi1.values[slice_]


## Message passing algorithms

In this section we will see how the elimination algorithm can be generalized to **tree graphs** to create the **message passing** algorithm, also known as the **belief propagation** algorithm. 

Let's start with a simple example of a tree graph. 

> **Definition:** A tree graph is a undirected acyclic graph, in which any pair of nodes are connected by exactly one path. 

An example of a tree graph is shown in the figure below. Notice there is a flow of information from the bottom to the top. 

<img src="img/EliminationTree.JPG" alt="Drawing" style="width:350px; height:400px"/>
<center> **Elimination applied to a tree for query on node 1** </center>

In general, we can compute the factor which results from eliminating variables below as follows:

$$m_{ji}(x_i) = \sum_{x_j} \Big(\ \psi(x_j)\ \psi(x_i,x_j)\ \prod_{f \in Pa(j) \backslash i} m_{fj}(x_j) \Big) \\
Where;  \\  
\psi(x_j) = edge\ potential\\   
\psi(x_i,x_j)\ clique\ potential
$$

> **Note:** The notation $f \in Pa(j) \backslash i$ indicates all factors $f$ in the set of parents $Pa(j)$ except $f = i$.

We can also think of this formulation as **passing a message** from node $j$ to node $i$. In other words, we can stay that $m_{ji}(x_i)$ represents **propagating a "belief"** from node $j$ to node $i$. Notice that **elimination on a tree is equivalent to message passing along the branches of the tree**. 

We can summarize the message passing or belief propagation algorithm:

> **Belief Propagation Algorithm**    
```
INPUT: Query node -> Q, graph tree    
Root of tree = Q     
FOR each edge from Q:    
    Orient edge away from Q toward leaves   
FOR message in schedule:   
    WHILE NOT leaf node:    
        propagate message in depth   
        Perform elimination by message-passing, or Belief Propagation 
        ```


> **Note:** Belief propagation uses the tree graph itself as the representation data structure. 

### Computing marginals with message passing

Now that we have looked at the basics of the the message passing or belief propagation algorithm we will generalize the method to efficiently compute marginal distributions of the nodes of a tree graph. The naive approach would be to perform a query on each of the nodes in the graph. While this approach would work in principle, it is computationally inefficient. We will explore an algorithm that will efficiently compute the marginal distributions on the graph, including message reuse. 

A key fact to notice that a node only **sends a message to a neighbor only once it has received messages from all other neighbors**. For example, in the figure above the  messages must be passed in  the following order:

1. $m_{53}x(x_3)$ and $m_{43}x(x_3)$,
2. $m_{32}x(x_2)$, and
3. $m_{21}x(x_1)$.

If were where to continue with simple message passing to compute all the marginal distributions of the variables on the graph we would see that we would be recomputing the same messages several times. In fact, the computational complexity of this approach is $NC$ where $N$ is the number of nodes and $C$ is the complexity of the branching of the nodes. 

Keeping in mind that the ordering requirement guides the construction of efficient algorithms, we explore a method to extend the message passing algorithm. 

Developing an efficient method for computing the marginal distributions on a tree leads us to the **two pass algorithm**. The two pass algorithm proceeds by two steps:
1. The **conditional probability distribution** (**CDP**) are updated using **Evidence**.  
2. The nodes **collects** messages from their neighbors, which **emit** messages to the node. Leaf nodes emit messages at the start of the collection step. 
3. One a node has collected messages from neighbors it **distributes** messages to its neighbors. In the distribution phase, a node may only emit a messages once it has collected messages from all of its neighbors. 
4. The marginal distributions are computed. 

A schematic view of this algorithm is illustrated in the figure below. 

<img src="img/SumProductTree.JPG" alt="Drawing" style="width:450px; height:400px"/>
<center> **Two pass algorithm on a tree** </center>

In the evidence step, the potential of nodes of evidence are updated using the following relationship:

$$\psi^E(x_i) = \psi(x_i)\ \delta(x_i, \bar{x}_j)\\
where\\
\delta(x_i, \bar{x}_j) = 1\ if\ i = j\\
\delta(x_i, \bar{x}_j) = 0\ otherwise$$

Messages are computed for both the collection and distribution steps are computed using the aforementioned relationship:

$$m_{ji}(x_i) = \sum_{x_j} \Big(\ \psi(x_j)\ \psi(x_i,x_j)\ \prod_{f \in N(j) \backslash i} m_{fj}(x_j) \Big)$$

The collection phase of the algorithm is illustrated in the figure below. 

<img src="img/Collect.JPG" alt="Drawing" style="width:450px; height:400px"/>
<center> **Collect phase of algorithm on a tree** </center>

The distribute phase of the algorithm is shown in the figure below.

<img src="img/Distribute.JPG" alt="Drawing" style="width:450px; height:400px"/>
<center> **Distribute phase of the algorithm on a tree** </center>

Finally, the marginal distribution of the nodes are computed using the following relationship:

$$p(x_i) = \psi^E(x_i)\ \prod_{j \in N(i)}m_{ji}(x_i)$$

Next, import the packages you will need for this example by executing the code below. 

## Example of belief propagation

With the belief propagation algorithm in mind, let's try a computational example. The code in the cell below imports the `BeliefPropagation` function and applies it to the DAG. 

In [7]:
from pgmpy.inference import BeliefPropagation
student_belief = BeliefPropagation(student_model)

  phi.values = phi.values[slice_]
  phi1.values = phi1.values[slice_]


With the belief propagation object created we can execute a query. The code in the cell below queries two variables of the DAG using evidence as shown.  

In [8]:
student_belief_query = student_belief.query(variables=['L','S'], evidence={'I':0, 'D':1})
print(student_belief_query['L'])
print(student_belief_query['S'])
student_belief.factors

╒═════╤══════════╕
│ L   │   phi(L) │
╞═════╪══════════╡
│ L_0 │   0.7980 │
├─────┼──────────┤
│ L_1 │   0.2020 │
╘═════╧══════════╛
╒═════╤══════════╕
│ S   │   phi(S) │
╞═════╪══════════╡
│ S_0 │   0.9500 │
├─────┼──────────┤
│ S_1 │   0.0500 │
╘═════╧══════════╛


  phi1.values = phi1.values[slice_]
  phi1.values = phi1.values[slice_]
  phi.values = phi.values[slice_]


defaultdict(list,
            {'D': [<DiscreteFactor representing phi(D:2) at 0x287c25c2ba8>,
              <DiscreteFactor representing phi(G:3, I:2, D:2) at 0x287c25c2da0>],
             'G': [<DiscreteFactor representing phi(G:3, I:2, D:2) at 0x287c25c2da0>,
              <DiscreteFactor representing phi(L:2, G:3) at 0x287c25c2fd0>],
             'I': [<DiscreteFactor representing phi(G:3, I:2, D:2) at 0x287c25c2da0>,
              <DiscreteFactor representing phi(I:2) at 0x287c25c2e80>,
              <DiscreteFactor representing phi(S:2, I:2) at 0x287c25c2a58>],
             'L': [<DiscreteFactor representing phi(L:2, G:3) at 0x287c25c2fd0>],
             'S': [<DiscreteFactor representing phi(S:2, I:2) at 0x287c25c2a58>]})

The resulting factors are associated with the cliques as shown in the dictionary printed above. 

We can print CPDs for these factors by accessing the first element of the list as follows:

In [9]:
print(student_belief.factors['L'][0])
print(student_belief.factors['S'][0])

╒═════╤═════╤════════════╕
│ L   │ G   │   phi(L,G) │
╞═════╪═════╪════════════╡
│ L_0 │ G_0 │     0.1000 │
├─────┼─────┼────────────┤
│ L_0 │ G_1 │     0.4000 │
├─────┼─────┼────────────┤
│ L_0 │ G_2 │     0.9900 │
├─────┼─────┼────────────┤
│ L_1 │ G_0 │     0.9000 │
├─────┼─────┼────────────┤
│ L_1 │ G_1 │     0.6000 │
├─────┼─────┼────────────┤
│ L_1 │ G_2 │     0.0100 │
╘═════╧═════╧════════════╛
╒═════╤═════╤════════════╕
│ S   │ I   │   phi(S,I) │
╞═════╪═════╪════════════╡
│ S_0 │ I_0 │     0.9500 │
├─────┼─────┼────────────┤
│ S_0 │ I_1 │     0.2000 │
├─────┼─────┼────────────┤
│ S_1 │ I_0 │     0.0500 │
├─────┼─────┼────────────┤
│ S_1 │ I_1 │     0.8000 │
╘═════╧═════╧════════════╛


Finally, we can display the clique beliefs for several of the cliques of the Markov network using the code shown below. 

In [10]:
for key in student_belief.clique_beliefs.keys():
    print(student_belief.clique_beliefs[key])

╒═════╤═════╤═════╤══════════════╕
│ G   │ D   │ I   │   phi(G,D,I) │
╞═════╪═════╪═════╪══════════════╡
│ G_0 │ D_0 │ I_0 │       0.1680 │
├─────┼─────┼─────┼──────────────┤
│ G_0 │ D_0 │ I_1 │       0.1260 │
├─────┼─────┼─────┼──────────────┤
│ G_0 │ D_1 │ I_0 │       0.0120 │
├─────┼─────┼─────┼──────────────┤
│ G_0 │ D_1 │ I_1 │       0.0300 │
├─────┼─────┼─────┼──────────────┤
│ G_1 │ D_0 │ I_0 │       0.2240 │
├─────┼─────┼─────┼──────────────┤
│ G_1 │ D_0 │ I_1 │       0.0112 │
├─────┼─────┼─────┼──────────────┤
│ G_1 │ D_1 │ I_0 │       0.0600 │
├─────┼─────┼─────┼──────────────┤
│ G_1 │ D_1 │ I_1 │       0.0180 │
├─────┼─────┼─────┼──────────────┤
│ G_2 │ D_0 │ I_0 │       0.1680 │
├─────┼─────┼─────┼──────────────┤
│ G_2 │ D_0 │ I_1 │       0.0028 │
├─────┼─────┼─────┼──────────────┤
│ G_2 │ D_1 │ I_0 │       0.1680 │
├─────┼─────┼─────┼──────────────┤
│ G_2 │ D_1 │ I_1 │       0.0120 │
╘═════╧═════╧═════╧══════════════╛
╒═════╤═════╤════════════╕
│ G   │ L   │   phi(G,L) │
╞

## Factor graphs and sum product algorithm

### Factor graphs

Several efficient inference methods make use of **factor graphs**. A factor graph can be undirected on directed. Factor graphs can have advantages over both DAGs and Markov networks in terms of the independencies which can be represented.   

**Definition:** A factor graph is comprised of factors $\phi_i(\mathcal{X_i})$. A function $f(x_1, x_2, \ldots, x_n)$, is represented on a graph with an undirected link on the each variable $x_i$, and $x_j \in \mathcal{X_i}$.  The function is then represented as:

$$f(x_1, x_2, \ldots, x_n) = \prod_{i=1}^n \phi_i(\mathcal{X_i})$$

In other words, a factor graph is a representation of a set of variables with a factor $\phi_i(\mathcal{X_i})$ on each edge. 

We can use the above formulation to represent a distribution as follows:

$$p(x_1, x_2, \ldots, x_n) = \frac{1}{Z} \prod_{i=1}^n \phi_i(\mathcal{X_i}) \\
where\\
Z = \sum_{\mathcal{X}} \prod_{i=1}^n \phi_i(\mathcal{X_i})$$

Let's look at an example of a undirected factor graph. The Markov network and the corresponding factor graphs are shown in the figure below. 

<img src="img/MarkovFactor.JPG" alt="Drawing" style="width:400px; height:150px"/>
<center> **Markov Network and Factor Graphs** </center>

The Markov network shown on the left has a single clique. This leads to two possible factorizations. The factor graph in the middle can represent the distribution with a single click= potential $\psi(X,Y,Z)$.  The factor graph on the right represents another factorization of the distribution into multiple clique potentials:

$$p(X,Y,Z) = \psi(X,Y) \psi(X,Z) \psi(Z,Y)$$

These two factorizations represent different factorizations with different independencies. But, they have the same representation on the Markov network. 

Let's look at another example. The figure below shows a simple directed network (DAG) and the corresponding factor graph. 

<img src="img/DAGFactor.JPG" alt="Drawing" style="width:250px; height:200px"/>
<center> **DAG and Corresponding Factor Graph** </center>

The factors required to represent a DAG are considerably different from those required for an undirected graph. The leaf nodes of the graph have terminal factors. Such factors are required to represent the distribution of the original DAG. 

### Sum-product algorithm

The generalization of the elimination method is known as the **sum-product algorithm**. The sum-product algorithm operates as messages on a factor graph. 

The sum-product algorithm is built on two message constructs, **variable to factor messages** and **factor to variable messages**. Let's start with the variable to factor message which is illustrated in the figure below. 

<img src="img/VarToFactor.JPG" alt="Drawing" style="width:350px; height:200px"/>
<center> **Variable to factor message example** </center>

In this figure the message passed from the variable $x$ to the factor $f$ is the product of the messages the variable receives from other factors $\{ f_1, f_2, f_3 \}$ and can be expressed as:

$$u_{x \rightarrow f}(x) = \prod_{h \in na(x) \backslash f} u_{h \rightarrow x}(x)$$  

In words potential of variable $x$ is product of the potentials of all neighboring factors, $na(x)$, except the factor to which the message is being being passed, denoted $\backslash f$. 

The factor to variable message is illustrated in the figure below:

<img src="img/FactorToVar.JPG" alt="Drawing" style="width:350px; height:200px"/>
<center> **Factor to variable message example** </center>

In this case, the potential of the factor, $f$, is passed to the variable $x$. The potential is the summation over all states, $\chi$, of the variable except $x$ of the product all messages the factor receives from its neighbors, $na(f)$, except $x$, denoted $\backslash x$. This potential can be expressed as:

$$u_{f \rightarrow x}(x) = \sum_{\chi_f \backslash x} \psi_f(\chi_f) \prod_{y \in na(f) \backslash x} u_{y \rightarrow f}(y)$$


Let's make this a bit more concrete. Ultimately, we want to compute the **marginal distribution** over the states $\chi_f$. This can problem can be formulated as:

$$p(\chi) = \frac{1}{Z} \prod_f \psi_f(\chi_f)\\
where\\
Z = \sum_\chi\prod_f \psi_f(\chi_f)$$


The sum-product algorithm can be summarized as follows:

> **Sum-Product Algorithm**
```
Input: query_node = x1, tree
Sort tree with x1 last.
Place potentials on the active list
FOR each node:
    Eliminate ith node by taking the sum-product over all neighbor potentials.
    Place the resulting factor on the active list
```

### Max-product algorithm

In many applications such as decision problems and control problems, we really just want to know the **maximum a posteriori** or **MAP** for the marginal distribution of interest. A simple modification of the sum-product algorithm, known as the **max-product algorithm**, gives us a relatively efficient method to compute the MAP. This method is also sometimes known as the **belief revision algorithm**.

The max-product algorithm is an example of a **linear program**. 

The variable to factor message for the max-product algorithm is identical to the sum-product algorithm.

$$u_{x \rightarrow f}(x) = \prod_{h \in na(x) \backslash f} u_{h \rightarrow x}(x)$$  

The for the factor to variable message the max-product algorithm only requires that we take the maximum:

$$u_{f \rightarrow x}(x) = argmax_{\chi_f \backslash x} \psi_f(\chi_f) \prod_{y \in na(f) \backslash x} u_{y \rightarrow f}(y)$$

The marginal then becomes just:

$$p(\chi) = argmax_x \prod_{f \in ne x} u_{f \rightarrow x}(x)$$

You can find more details of the max-product algorithm in the suggested readings or Sections 13.1, 13.2 and 13.3 of Koller and Friedman.  

There are other variations on the sum-product algorithm which can be quite useful in applications. For example, we may want to know the N most probable states. Or, we may want to find the shortest route or path though the tree of random variables. 

### Limitations of tree algorithms

As has already been discussed, the sum-product algorithm is for trees. What happens when the graph is not a tree? Let's look at an example. In the figure below the factor graph on the left has a 4 cycle. We can say that this is a **loopy graph**.  

<img src="img/Loopy.JPG" alt="Drawing" style="width:300px; height:150px"/>
<center> **Four cycle graph and with variable eliminated** </center>

When the variable $W$ is eliminated, a new cord must be added between $Z$ and $Y$. No other variable can be eliminated since multiple factors are involved and the algorithms operate one factor at a time.     

## Introduction to the junction tree algorithm

In the previous sections we examined the message passing or belief propagation algorithm and the sum-product algorithm. These algorithms can efficiently compute the marginal distributions of variables in a tree graph. However, as you have seen, if there are a cycles of 4 or more nodes, these algorithms fail. The **junction tree algorithm** transforms cyclic graphs into clique trees.

In this lesson we will only discuss the basic ideas of the algorithm. The details are rather lengthly and tedious; although clearly important if you need to implement the method. For more details please see Chapter 6 of Barber or Chapter 10 of Koller and Friedman. 

In the remainder of this section we will briefly explore the steps of the junction tree algorithm. Conceptually the approach taken in the junction tree algorithm is simple. These general steps are:
1. Moralize the graph, following the process we have already applied.
2. **Triangulate** the graph to transform multiply connected graphs to trees. All cycles of length 4 or more are triangulated into 3 cycles. Triangulation solutions are not unique, and can have a significant effect on the computational complexity of the algorithm. 
3. Build a **clique tree** from the transformed graph. The clique tree is composed of clique variables and factors. 
4. Propagate the potentials by local message passing from either factors to variables or variables and factors. The messages are normalized or constrained in such a way that the distribution represented by the triangulated graph is the same as the moralized graph. 

### Moralization

We have already discussed the principles of moralization in a previous lesson. The same approach is used for the junction tree algorithm. 

It is worth nothing that the moralization process marries pairs of nodes that form a V-structure. This is not the same as triangulation of cycles of 4 or more nodes. 

### Triangularization 

With the graph moralized, there can still be cycles with four or more variables. The question at this point is if the cliques created in these cycles lead to consistent results. To purse this question consider the graph and the resulting clique tree shown in the figure below.

<img src="img/FourCycle.JPG" alt="Drawing" style="width:350px; height:150px"/>
<center> **A four cycle graph with the corresponding clique tree** </center>

Notice that there is no way to ensure that the probability associated with variable 3 is consistent between the two branches of the tree. There is no guarantee of **global consistency** with cycles of four or more variables. 

To solve the above problem we need to perform a procedure known as **triangularization**. The triangularization procedure adds an edge to the four cycle as shown in the figure below.  

<img src="img/Triangle1.JPG" alt="Drawing" style="width:150px; height:300px"/>
<center> **A triangleized four cycle graph** </center>

The triangularized graph leads to a globally consistent clique graph. Notice that there are two possible triangularizations, which lead to two different clique trees. While the clique trees are globally consistent there they are not unique. 

### Clique tree

The triangulated graph is not transformed to a clique tree. A clique tree is comprised of the cliques and separators. The separators are comprised of the common nodes between the cliques. An example of a simple four cycle graph and corresponding clique trees is shown in the figure below. Notice that even in this simple case, the result is not unique. 

<img src="img/CliqueTree.JPG" alt="Drawing" style="width:350px; height:300px"/>
<center> **A triangleized four cycle graph with the corresponding clique trees** </center>

The factor in the square box is known as the **seperator** and is the intersection of the two cliques. For the upper clique three in the above figure, the distribution can be expressed as:

$$p(\chi) = \frac{\phi(1,2,3)\ \phi(2,3,4)}{\phi(2,3)}$$

Notice that the separator is the normalization.

The computational complexity of the junction tree algorithm is dependent on the maximum width of the clique tree. Ideally, we want to find the clique tree with the minimum width. However, finding a minimal width clique tree is an NP hard search problem. In practice, one of several heuristics is used to produce a good, but not necessarily optimal, solution.  

### Message passing

As has already been mentioned, the messages passed on the clique tree are constrained so the distribution represented is the same as the moralized graph. The form of the distribution of the moralized graph should be familiar by now:

$$p(x) = \frac{1}{Z} \prod_{c \in \mathcal{C}(G)} \phi(x_c)$$

This distribution must be the same as the one represented by the junction tree, $\mathcal{T}$. This leads to the following relationship which constrains the values of the messages.

$$\prod_{c \in \mathcal{C}(G)} \phi(x_c) = \prod_{c \in \mathcal{T}(G)} \phi(x_c)$$

As with the previous discussed message passing algorithms, the messages are first passed from the leaves toward the root or query node (collect phase) and then back again (distribute phase).


### A computational example

With the bit of theory in mind let's work through a computational example for the junction tree algorithm. We will expand the graph for the student problem to include the CPD of her obtaining the job she is interested in. The updated graph is shown below.    

<img src="img/StudentGraph.JPG" alt="Drawing" style="width:400px; height:300px"/>
<center> **Belief network with Job node added** </center>

The first step in applying the junction tree algorithm is to moralize the belief network to produce the undirected Markov network. The moralized graph is shown in the figure below. 

<img src="img/Moralized.JPG" alt="Drawing" style="width:400px; height:300px"/>
<center> **Moralized graph** </center>

Notice that the above graph has two cycles of 4 nodes each, $\{ I, G, J, S \} and $\{ I, G, L, S \}. This graph must be trianglized. There are several possibilities for triangulrization, one of which is illustrated in the figure below.  

<img src="img/Triangulated.JPG" alt="Drawing" style="width:400px; height:300px"/>
<center> **Triangulated graph** </center>

The triangulated graph is now ready for inference using clique elimination.

Let's see how this example works in code. As a first step, you need to import the `JunctionTree` function.

In [11]:
from pgmpy.models import JunctionTree

We need to add the Job or $J$ node to the directed graph. Execute the code in the cell below to add this CPD to the model. 

In [12]:
CDP_J = TabularCPD(variable='J', variable_card=2, 
                   values=[[0.9, 0.7, 0.6, 0.3],
                           [0.1, 0.3, 0.4, 0.7]],
                  evidence=['L', 'S'],
                  evidence_card=[2, 2])
print(CDP_J)
student_model.add_edge('L','J')
student_model.add_edge('S','J')
student_model.add_cpds(CDP_J)
student_model.check_model()

╒═════╤═════╤═════╤═════╤═════╕
│ L   │ L_0 │ L_0 │ L_1 │ L_1 │
├─────┼─────┼─────┼─────┼─────┤
│ S   │ S_0 │ S_1 │ S_0 │ S_1 │
├─────┼─────┼─────┼─────┼─────┤
│ J_0 │ 0.9 │ 0.7 │ 0.6 │ 0.3 │
├─────┼─────┼─────┼─────┼─────┤
│ J_1 │ 0.1 │ 0.3 │ 0.4 │ 0.7 │
╘═════╧═════╧═════╧═════╧═════╛


True

The DAG needs to be transformed into a junction tree using the `to_junction_tree()` method. The code in the cell below transforms the DAG to the junction tree and then prints attributes of the junction tree object.  

In [13]:
student_jt = student_model.to_junction_tree()

  phi.values = phi.values[slice_]
  phi1.values = phi1.values[slice_]


Next, you can apply the `BeliefPropagration` method to the junction tree object as shown below.

In [14]:
student_belief_jt = BeliefPropagation(student_jt)
student_belief_jt.factors

defaultdict(list,
            {'D': [<DiscreteFactor representing phi(L:2, I:2, D:2, G:3) at 0x287c25d7ef0>],
             'G': [<DiscreteFactor representing phi(L:2, I:2, D:2, G:3) at 0x287c25d7ef0>],
             'I': [<DiscreteFactor representing phi(L:2, I:2, S:2) at 0x287bd67f940>,
              <DiscreteFactor representing phi(L:2, I:2, D:2, G:3) at 0x287c25d7ef0>],
             'J': [<DiscreteFactor representing phi(L:2, J:2, S:2) at 0x287c25d7e48>],
             'L': [<DiscreteFactor representing phi(L:2, J:2, S:2) at 0x287c25d7e48>,
              <DiscreteFactor representing phi(L:2, I:2, S:2) at 0x287bd67f940>,
              <DiscreteFactor representing phi(L:2, I:2, D:2, G:3) at 0x287c25d7ef0>],
             'S': [<DiscreteFactor representing phi(L:2, J:2, S:2) at 0x287c25d7e48>,
              <DiscreteFactor representing phi(L:2, I:2, S:2) at 0x287bd67f940>]})

Examine the factors printed above. Notice that these are maximal cliques. You can see that each of the nodes is represented by the the factor associated with a particular clique.  

With the belief propagated, queries can be performed given evidence. Execute the example in the cell below to find the posterior probabilities of the student getting the job, given that she in intelligent and the machine learning course was difficult. 

In [15]:
print(student_belief_jt.query(variables=['J'], evidence={'I':1, 'D':1})['J'])

╒═════╤══════════╕
│ J   │   phi(J) │
╞═════╪══════════╡
│ J_0 │   0.4998 │
├─────┼──────────┤
│ J_1 │   0.5002 │
╘═════╧══════════╛


  phi1.values = phi1.values[slice_]
  phi1.values = phi1.values[slice_]


Finally, you can see the beliefs on the maximal cliques of the junction tree by executing the code in the cell below. 

In [16]:
for key in student_belief_jt.clique_beliefs.keys():
    print(student_belief_jt.clique_beliefs[key])

╒═════╤═════╤═════╤══════════════╕
│ L   │ J   │ S   │   phi(L,J,S) │
╞═════╪═════╪═════╪══════════════╡
│ L_0 │ J_0 │ S_0 │       0.4045 │
├─────┼─────┼─────┼──────────────┤
│ L_0 │ J_0 │ S_1 │       0.0397 │
├─────┼─────┼─────┼──────────────┤
│ L_0 │ J_1 │ S_0 │       0.0449 │
├─────┼─────┼─────┼──────────────┤
│ L_0 │ J_1 │ S_1 │       0.0170 │
├─────┼─────┼─────┼──────────────┤
│ L_1 │ J_0 │ S_0 │       0.2104 │
├─────┼─────┼─────┼──────────────┤
│ L_1 │ J_0 │ S_1 │       0.0430 │
├─────┼─────┼─────┼──────────────┤
│ L_1 │ J_1 │ S_0 │       0.1402 │
├─────┼─────┼─────┼──────────────┤
│ L_1 │ J_1 │ S_1 │       0.1003 │
╘═════╧═════╧═════╧══════════════╛
╒═════╤═════╤═════╤══════════════╕
│ L   │ I   │ S   │   phi(L,I,S) │
╞═════╪═════╪═════╪══════════════╡
│ L_0 │ I_0 │ S_0 │       0.4410 │
├─────┼─────┼─────┼──────────────┤
│ L_0 │ I_0 │ S_1 │       0.0232 │
├─────┼─────┼─────┼──────────────┤
│ L_0 │ I_1 │ S_0 │       0.0084 │
├─────┼─────┼─────┼──────────────┤
│ L_0 │ I_1 │ S_1 │ 

#### Copyright 2018, Stephen F Elston. All rights reserved.