# BMI/CS 576 - HW6
The objectives of this homework are to better understand

* the statistical dependencies represented by a Bayesian network
* alternative representations of conditional probability distributions (CPDs)
* how model evidence works as a score for Bayesian networks
* the Sparse Candidate algorithm

## HW policies
Before starting this homework, please read over the [homework policies](https://canvas.wisc.edu/courses/167969/pages/hw-policies) for this course.  In particular, note that homeworks are to be completed *individually*.

You are welcome to use any code from the weekly notebooks in your solutions to the HW.

## PROBLEM 1 (30 POINTS)

Consider the Bayesian network below 

![simple_network](simple_network.png)

**(a)** Give a table specifying the joint probability distribution, $P(A, B, C)$ represented by the Bayesian network.

**(b)** Given your table from (a), compute $P(A = true\ |\ C = true)$

**(c)** Given your table from (a), compute $P(A = true\ |\ B = true)$

**(d)** Given your table from (a), compute $P(A = true\ |\ B = true, C = true)$

**(e)** Given your table from (a), is $A$ independent of $B$? Justify your answer.
  
**(f)** Given your table from (a), is $A$ independent of $B$ given $C$? Justify your answer.



###
### solution to problem 1
###


## PROBLEM 2 (25 POINTS)
As shown in the slide "Representing CPDs for Discrete Variables" (slide 8) of the lecture "Networks - Introduction to Bayesian Networks" some conditional probability distributions (CPD) can also be represented with a tree.

**(a)** Give the CPD table for the distribution $P(D\ |\ A,B,C)$ represented by the tree below.
![](decision_tree.png)

**(b)** Give the most compact tree (i.e., one with the fewest nodes) represention of the distribution $P(D\ |\ A,B,C)$ specified by the CPD table below.

![](decision_tree_cpd.png)

**(c)** Suppose that you know that the CPD $P(D\ |\ A,B,C)$ can be represented by the tree structure of part (a), but that you don't know the parameters at the leaves of the tree.  Now suppose you are given some training data with which to estimate the CPD.  What is the major advantage of the tree representation over the CPD table representation in estimating the parameters of the CPD?

###
### solution to problem 2
###


## PROBLEM 3 (30 POINTS)

Consider two possible Bayesian networks for two binary random variables, $X_1$ and $X_2$, shown below.

![](two_var_networks.png)

**(a)** Give the likelihood function, $P(D|G_0, \theta)$, for network $G_0$ in terms of the count variables shown in the table above.

**(b)** Give the likelihood function, $P(D|G_1, \theta)$, for network $G_1$ in terms of the count variables shown in the table above.

**(c)** Suppose that we estimate maximum likelihood values, $\hat{\theta}_{MLE}$, for the parameters of each of the two networks given a data set, $D$.  Show that $P(D|G_1, \hat{\theta}_{MLE}) \geq P(D|G_0, \hat{\theta}_{MLE})$ for any data set, $D$, and thus that the likelihood is not a good way to score networks. *Hint: consider what happens when $\theta_2 = \theta_{20} = \theta_{21}$.*

**(d)** Derive the model evidence, $P(D|G_0)$ for the network $G_0$ in terms of the count variables shown in the table above.

**(e)** Derive the model evidence, $P(D|G_1)$ for the network $G_1$ in terms of the count variables shown in the table above.

**(f)** Consider the case in which $n = 20$ and $n_{0-} = n_{1-} = n_{-0} = n_{-1} = 10$ (i.e., each row and column of the data table sum to half of the total observations).  Compute the difference in the log model evidence between the two models, $\log(P(D|G_1)) - \log(P(D|G_0))$, over all possible values of $n_{00}$ (note that specifying $n_{00}$ specifies all other counts).  These values indicate for which data sets we would prefer $G_1$ over $G_0$, and vice versa.  Show your results as a plot of $\log(P(D|G_1)) - \log(P(D|G_0))$ vs. $n_{00}$.

###
### solution to problem 3
###


In [None]:
import math
import matplotlib.pyplot as plt

In [None]:
###
### YOUR CODE HERE
###


## PROBLEM 4 (15 POINTS) 


Suppose we wish to reconstruct the gene regulatory network for three genes, $X$, $Y$, and $Z$, using the Bayesian network approach and the “sparse candidate” algorithm. We are given data from 100 independent experiments in which the expression levels of the three genes are measured. For simplicity, we model each gene as being either “on” (T) or “off” (F). Below is a table summarizing the number of times (count) each configuration of gene expression status was observed in these experiments.


| X | Y | Z | count |
|---|---|---|-------|
| T | T | T |  36   |
| T | T | F |   4   |
| T | F | T |   2   |
| T | F | F |   8   |
| F | T | T |   9   |
| F | T | F |   1   |
| F | F | T |   8   |
| F | F | F |  32   |


**(a)** Suppose we wish to compute a single candidate parent for $Z$. In the first round of the “sparse candidate” algorithm, we compute the mutual information between $Z$ and the other random variables. Compute the mutual information between $Z$ and $X$, $I(X,Z)$, based on the frequencies observed in the data.

**(b)** Compute the mutual information between $Z$ and $Y$, $I(Y,Z)$, based on the frequencies observed in the data.

**(c)** Based on your answers to (a) and (b), which gene would be selected as the candidate parent for Z? Briefly explain your answer.

###
### solution to problem 4
###
