In [None]:
'''
 * Copyright (c) 2008 Radhamadhab Dalai
 *
 * Permission is hereby granted, free of charge, to any person obtaining a copy
 * of this software and associated documentation files (the "Software"), to deal
 * in the Software without restriction, including without limitation the rights
 * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 * copies of the Software, and to permit persons to whom the Software is
 * furnished to do so, subject to the following conditions:
 *
 * The above copyright notice and this permission notice shall be included in
 * all copies or substantial portions of the Software.
 *
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 * THE SOFTWARE.
'''

![image-3.png](attachment:image-3.png)

Fig.15 The ﬁrst of three examples of graphs over three variables a, b, and c used to discuss conditional independence properties of directed graphical models.

![image-2.png](attachment:image-2.png)

Fig.16 As in Fig.15 but where we have conditioned on the value of variable c.

![image.png](attachment:image.png)

Fig.17 The second of our three examples of 3-node graphs used to motivate the conditional independence framework for directed graphical models.


##  Conditional Independence

Conditional independence is a critical concept in probability distributions involving multiple variables (Dawid, 1980). Let's consider three variables $a$, $b$, and $c$. Suppose the conditional distribution of $a$ given $b$ and $c$ does not depend on $b$:

$$
p(a | b, c) = p(a | c)
\tag{8.20}
$$

This means that \(a\) is **conditionally independent** of $b$ given $c$.

## Joint Distribution Factorization

The joint distribution of $a$ and $b$ conditioned on $c$ can be expressed as:

$$
p(a, b | c) = p(a | b, c) p(b | c) = p(a | c) p(b | c)
\tag{8.21}
$$

Here, the product rule of probability and Equation $8.20$ are used. This shows that, conditioned on $c$, the joint distribution of $a$ and $b$ factorizes into the product of their marginal distributions (conditioned on $c$).

Thus, $a$ and $b$ are **statistically independent given $c$**.

### Notation for Conditional Independence

To simplify notation, we denote conditional independence as:

$$
a \perp\!\!\!\perp b \, | \, c
\tag{8.22}
$$

This is equivalent to Equation \(8.20\).

### Importance of Conditional Independence

Conditional independence simplifies the structure and computations of probabilistic models, enabling efficient inference and learning. Graphical models allow us to directly identify conditional independence properties without analytical manipulations.

---

## Three Example Graphs

### Example 1: Graph with Tail-to-Tail Connections

Consider a graph where variables $a$, $b$, and $c$ form a structure as shown:




#### Joint Distribution

The joint distribution is written as:

$$
p(a, b, c) = p(a | c) p(b | c) p(c)
\tag{8.23}
$$

#### Marginal Independence ($\emptyset$)

If none of the variables are observed, marginalizing over $c$ yields:

$$
p(a, b) = \sum_c p(a | c) p(b | c) p(c)
\tag{8.24}
$$

In general, this does not factorize into $p(a) p(b)$, so:

$$
a \not\!\perp\!\!\!\perp b \, | \, \emptyset
\tag{8.25}
$$

#### Conditional Independence ($c$)

Now suppose we condition on $c$. The conditional distribution becomes:

$$
p(a, b | c) = \frac{p(a, b, c)}{p(c)} = p(a | c) p(b | c)
$$

This implies:

$$
a \perp\!\!\!\perp b \, | \, c
$$

Graphically, $c$ blocks the path between $a$ and $b$, making them conditionally independent.

---

### Example 2: Graph with Head-to-Tail Connections

Now consider the following graph:



#### Joint Distribution

The joint distribution is given by:

$$
p(a, b, c) = p(a) p(c | a) p(b | c)
\tag{8.26}
$$

#### Marginal Independence ($\emptyset$)

If none of the variables are observed, marginalizing over $c$ gives:

$$
p(a, b) = p(a) \sum_c p(c | a) p(b | c)
$$

In general, this does not factorize into $p(a) p(b)$, so:

$$
a \not\!\perp\!\!\!\perp b \, | \, \emptyset
$$

---

These examples demonstrate how conditional independence properties can be derived and interpreted using directed graphs. The concept of **d-separation** will generalize this framework to arbitrary graphs.




#### Joint Distribution

The joint distribution is:

$$
p(a, b, c) = p(a)p(b)p(c|a, b)
\tag{8.28}
$$

#### No Variables Observed (\(\emptyset\))

If none of the variables are observed, marginalizing over \(c\) gives:

$$
p(a, b) = p(a)p(b)
$$

Thus:

$$
a \perp\!\!\!\perp b \, | \, \emptyset
\tag{8.29}
$$

#### Conditioning on \(c\)

Conditioning on \(c\) gives:

$$
p(a, b|c) = \frac{p(a)p(b)p(c|a, b)}{p(c)}
$$

In general, this does not factorize into \(p(a)p(b)\):

$$
a \not\!\perp\!\!\!\perp b \, | \, c
$$

Here, \(c\) is a head-to-head node, which unblocks the path and renders \(a\) and \(b\) dependent.

---

### Subtlety of Head-to-Head Nodes

A head-to-head node blocks a path if it is unobserved. However, if the node or any of its descendants is observed, the path becomes unblocked.

#### Terminology: Descendants

A node \(y\) is a descendant of \(x\) if there is a path from \(x\) to \(y\) following the direction of the arrows.

#### Summary of Path Blocking:

1. **Tail-to-tail** or **head-to-tail** nodes:
   - Leave paths unblocked unless observed.

2. **Head-to-head** nodes:
   - Block paths if unobserved.
   - Unblock paths if the node or its descendants are observed.

---

## Example: Explaining Away

Consider a specific head-to-head graph with three binary random variables:


- **$B$:** Battery state $(B = 1$: charged, $B = 0$: flat)
- **$F$:** Fuel tank state $(F = 1$: full, $F = 0$: empty)
- **$G$:** Fuel gauge reading $(G = 1$: full, $G = 0$: empty)

### Prior Probabilities:

- $p(B = 1) = 0.9\), \(p(F = 1) = 0.9$
- Given $B$ and $F$, the probabilities for $G$ are:

$$
\begin{aligned}
p(G = 1|B = 1, F = 1) &= 0.8 \\
p(G = 1|B = 1, F = 0) &= 0.2 \\
p(G = 1|B = 0, F = 1) &= 0.2 \\
p(G = 1|B = 0, F = 0) &= 0.1
\end{aligned}
$$

### Observing \(G = 0\) (Gauge Reads Empty)

Using Bayes' theorem:

#### Denominator of Bayes' Theorem:

$$
p(G = 0) = \sum_{B \in \{0, 1\}} \sum_{F \in \{0, 1\}} p(G = 0|B, F)p(B)p(F) = 0.315
\tag{8.30}
$$

#### Posterior Probability:

$$
p(F = 0|G = 0) = \frac{p(G = 0|F = 0)p(F = 0)}{p(G = 0)} = \frac{0.81 \times 0.1}{0.315} = 0.257
\tag{8.32}
$$

Thus, observing $G = 0$ makes $F = 0$ more likely $(p(F = 0|G = 0) > p(F = 0)$).

---

### Observing $B = 0$ (Battery is Flat)

Now suppose \(B = 0\). The posterior probability of \(F = 0\) becomes:

$$
p(F = 0|G = 0, B = 0) = \frac{p(G = 0|B = 0, F = 0)p(F = 0)}{\sum_{F \in \{0, 1\}} p(G = 0|B = 0, F)p(F)} = 0.111
\tag{8.33}
$$

### Explaining Away:

Observing \(B = 0\) reduces the probability of \(F = 0\) because \(B = 0\) "explains away" the observation \(G = 0\).

### General Observations:

1. Observing \(G = 0\) made \(F = 0\) more likely.
2. Observing \(B = 0\) reduced the probability of \(F = 0\), as \(B = 0\) explains the gauge reading.


In [1]:
# Conditional Independence and Explaining Away Implementation
import numpy as np

# Define the probabilities
p_B = 0.9  # P(B=1)
p_F = 0.9  # P(F=1)

# Conditional probabilities for G given B and F
p_G_given_B_F = {
    (1, 1): 0.8,  # P(G=1 | B=1, F=1)
    (1, 0): 0.2,  # P(G=1 | B=1, F=0)
    (0, 1): 0.2,  # P(G=1 | B=0, F=1)
    (0, 0): 0.1   # P(G=1 | B=0, F=0)
}

# Remaining probabilities for G
p_G_given_B_F = {k: (v, 1 - v) for k, v in p_G_given_B_F.items()}  # (P(G=1), P(G=0))

# Compute prior probability of G=0
def compute_prior_G():
    prob_G_0 = 0
    for B in [0, 1]:
        for F in [0, 1]:
            p_B_val = p_B if B == 1 else 1 - p_B
            p_F_val = p_F if F == 1 else 1 - p_F
            prob_G_0 += p_G_given_B_F[(B, F)][1] * p_B_val * p_F_val  # P(G=0)
    return prob_G_0

p_G_0 = compute_prior_G()
print(f"P(G=0): {p_G_0:.3f}")

# Posterior probability of F=0 given G=0
def posterior_F_given_G(F_obs, G_obs):
    # P(G=0 | F=F_obs) * P(F=F_obs) / P(G=0)
    prob_G_given_F = 0
    for B in [0, 1]:
        p_B_val = p_B if B == 1 else 1 - p_B
        prob_G_given_F += p_G_given_B_F[(B, F_obs)][G_obs] * p_B_val
    p_F_val = p_F if F_obs == 1 else 1 - p_F
    return (prob_G_given_F * p_F_val) / p_G_0

p_F_0_given_G_0 = posterior_F_given_G(F_obs=0, G_obs=1)
print(f"P(F=0 | G=0): {p_F_0_given_G_0:.3f}")

# Posterior probability of F=0 given G=0 and B=0
def posterior_F_given_G_and_B(F_obs, G_obs, B_obs):
    # P(G=0 | F=F_obs, B=B_obs) * P(F=F_obs) / sum_F P(G=0 | F, B=B_obs) * P(F)
    prob_G_given_B_F = p_G_given_B_F[(B_obs, F_obs)][G_obs]
    p_F_val = p_F if F_obs == 1 else 1 - p_F

    denominator = 0
    for F in [0, 1]:
        prob_G_given_B_F_all = p_G_given_B_F[(B_obs, F)][G_obs]
        p_F_all = p_F if F == 1 else 1 - p_F
        denominator += prob_G_given_B_F_all * p_F_all

    return (prob_G_given_B_F * p_F_val) / denominator

p_F_0_given_G_0_B_0 = posterior_F_given_G_and_B(F_obs=0, G_obs=1, B_obs=0)
print(f"P(F=0 | G=0, B=0): {p_F_0_given_G_0_B_0:.3f}")


P(G=0): 0.315
P(F=0 | G=0): 0.257
P(F=0 | G=0, B=0): 0.111


In [2]:
# Conditional Independence and Explaining Away Implementation (Pure Python)

# Define probabilities
p_B = 0.9  # P(B=1)
p_F = 0.9  # P(F=1)

# Conditional probabilities for G given B and F
p_G_given_B_F = {
    (1, 1): 0.8,  # P(G=1 | B=1, F=1)
    (1, 0): 0.2,  # P(G=1 | B=1, F=0)
    (0, 1): 0.2,  # P(G=1 | B=0, F=1)
    (0, 0): 0.1   # P(G=1 | B=0, F=0)
}

# Compute prior probability of G=0
def compute_prior_G():
    prob_G_0 = 0
    for B in [0, 1]:
        for F in [0, 1]:
            # P(B), P(F)
            p_B_val = p_B if B == 1 else 1 - p_B
            p_F_val = p_F if F == 1 else 1 - p_F
            # Add P(G=0 | B, F) * P(B) * P(F)
            prob_G_0 += (1 - p_G_given_B_F[(B, F)]) * p_B_val * p_F_val
    return prob_G_0

p_G_0 = compute_prior_G()
print(f"P(G=0): {p_G_0:.3f}")

# Posterior probability of F=0 given G=0
def posterior_F_given_G(F_obs, G_obs):
    # Numerator: P(G=0 | F=F_obs) * P(F=F_obs)
    prob_G_given_F = 0
    for B in [0, 1]:
        p_B_val = p_B if B == 1 else 1 - p_B
        prob_G_given_F += (1 - p_G_given_B_F[(B, F_obs)]) * p_B_val

    p_F_val = p_F if F_obs == 1 else 1 - p_F
    numerator = prob_G_given_F * p_F_val

    # Denominator: P(G=0)
    denominator = p_G_0

    return numerator / denominator

p_F_0_given_G_0 = posterior_F_given_G(F_obs=0, G_obs=0)
print(f"P(F=0 | G=0): {p_F_0_given_G_0:.3f}")

# Posterior probability of F=0 given G=0 and B=0
def posterior_F_given_G_and_B(F_obs, G_obs, B_obs):
    # Numerator: P(G=0 | F=F_obs, B=B_obs) * P(F=F_obs)
    prob_G_given_B_F = (1 - p_G_given_B_F[(B_obs, F_obs)])
    p_F_val = p_F if F_obs == 1 else 1 - p_F
    numerator = prob_G_given_B_F * p_F_val

    # Denominator: Sum_F P(G=0 | F, B=B_obs) * P(F)
    denominator = 0
    for F in [0, 1]:
        prob_G_given_B_F_all = (1 - p_G_given_B_F[(B_obs, F)])
        p_F_all = p_F if F == 1 else 1 - p_F
        denominator += prob_G_given_B_F_all * p_F_all

    return numerator / denominator

p_F_0_given_G_0_B_0 = posterior_F_given_G_and_B(F_obs=0, G_obs=0, B_obs=0)
print(f"P(F=0 | G=0, B=0): {p_F_0_given_G_0_B_0:.3f}")


P(G=0): 0.315
P(F=0 | G=0): 0.257
P(F=0 | G=0, B=0): 0.111


## D-separation

We now give a general statement of the d-separation property (Pearl, 1988) for directed graphs. Consider a general directed graph in which $ A $, $ B $, and $ C $ are arbitrary non-intersecting sets of nodes (whose union may be smaller than the complete set of nodes in the graph). We wish to ascertain whether a particular conditional independence statement $ A \perp\!\!\!\perp B | C $ is implied by a given directed acyclic graph. 

To do so, we consider all possible paths from any node in $ A $ to any node in $ B $. Any such path is said to be blocked if it includes a node such that either:

1. The arrows on the path meet either head-to-tail or tail-to-tail at the node, and the node is in the set $ C $, or
2. The arrows meet head-to-head at the node, and neither the node, nor any of its descendants, is in the set $ C $.

If all paths are blocked, then $ A $ is said to be d-separated from $ B $ by $ C $, and the joint distribution over all of the variables in the graph will satisfy $ A \perp\!\!\!\perp B | C $.

The concept of d-separation is illustrated in Fig.22. In graph (a), the path from $ a $ to $ b $ is not blocked by node $ f $ because it is a tail-to-tail node for this path and is not observed, nor is it blocked by node $ e $ because, although the latter is a head-to-head node, it has a descendant $ c $ which is in the conditioning set. Thus the conditional independence statement $ a \perp\!\!\!\perp b | c $ does not follow from this graph. In graph (b), the path from $ a $ to $ b $ is blocked by node $ f $ because this is a tail-to-tail node that is observed, and so the conditional independence property $ a \perp\!\!\!\perp b | f $ will be satisfied by any distribution that factorizes according to this graph. Note that this path is also blocked by node $ e $ because $ e $ is a head-to-head node and neither it nor its descendant are in the conditioning set.

For the purposes of d-separation, parameters such as $ \alpha $ and $ \sigma^2 $ in Fig.5, indicated by small filled circles, behave in the same way as observed nodes. However, there are no marginal distributions associated with such nodes. Consequently, parameter nodes never themselves have parents and so all paths through these nodes will always be tail-to-tail and hence blocked. Consequently, they play no role in d-separation.

### Example: Gaussian Distribution

Another example of conditional independence and d-separation is provided by the concept of i.i.d. (independent identically distributed) data introduced in Section 1.2.4. Consider the problem of finding the posterior distribution for the mean $ \mu $ of a univariate Gaussian distribution. This can be represented by the directed graph shown in Fig.23 in which the joint distribution is defined by a prior $ p(\mu) $ together with a set of conditional distributions $ p(x_n | \mu) $ for $ n = 1, \ldots, N $.

In practice, we observe $ D = \{x_1, \ldots, x_N\} $ and our goal is to infer $ \mu $. Suppose, for a moment, that we condition on $ \mu $ and consider the joint distribution of the observations. Using d-separation, we note that there is a unique path from any $ x_i $ to any other $ x_j \neq i $ and that this path is tail-to-tail with respect to the observed node $ \mu $. Every such path is blocked and so the observations $ D = \{x_1, \ldots, x_N\} $ are independent given $ \mu $, so that

$$
\prod_{n=1}^N p(x_n | \mu) = p(D | \mu)
$$

### Example: Naive Bayes Classifier

A related graphical structure arises in an approach to classification called the **naive Bayes** model, in which we use conditional independence assumptions to simplify the model structure. Suppose our observed variable consists of a $ D $-dimensional vector $ x = (x_1, \ldots, x_D)^T $, and we wish to assign observed values of $ x $ to one of $ K $ classes. Using the 1-of-$ K $ encoding scheme, we can represent these classes by a $ K $-dimensional binary vector $ z $. We can then define a generative model by introducing a multinomial prior $ p(z | \mu) $ over the class labels, where the $ k $-th component $ \mu_k $ of $ \mu $ is the prior probability of class $ C_k $, together with a conditional distribution $ p(x | z) $ for the observed vector $ x $.

The key assumption of the naive Bayes model is that, conditioned on the class $ z $, the distributions of the input variables $ x_1, \ldots, x_D $ are independent. The graphical representation of this model is shown in Fig.24. We see that observation of $ z $ blocks the path between $ x_i $ and $ x_j $ for $ j \neq i $ (because such paths are tail-to-tail at the node $ z $) and so $ x_i $ and $ x_j $ are conditionally independent given $ z $.

If, however, we marginalize out $ z $ (so that $ z $ is unobserved), the tail-to-tail path from $ x_i $ to $ x_j $ is no longer blocked. This tells us that in general the marginal density $ p(x) $ will not factorize with respect to the components of $ x $.

We encountered a simple application of the naive Bayes model in the context of fusing data from different sources for medical diagnosis in Section 1.5. If we are given a labelled training set, comprising inputs $ \{x_1, \ldots, x_N\} $ together with their class labels, then we can fit the naive Bayes model to the training data.


In [None]:
import networkx as nx
from pgmpy.models import BayesianNetwork
from pgmpy.factors.discrete import TabularCPD
from pgmpy.inference import VariableElimination

# Step 1: Define the directed acyclic graph (DAG)
# Example: A -> B, C -> B, A -> C, and C -> D
G = nx.DiGraph()

# Adding edges to the graph
G.add_edges_from([('A', 'B'), ('C', 'B'), ('A', 'C'), ('C', 'D')])

# Step 2: Function to check if nodes are d-separated
def check_d_separation(graph, node_a, node_b, conditioning_set):
    """
    This function checks if nodes `node_a` and `node_b` are d-separated given `conditioning_set`
    in a directed acyclic graph (DAG).
    """
    # Using networkx's d-separation function to check if node_a and node_b are d-separated
    return nx.d_separated(graph, node_a, node_b, conditioning_set)

# Step 3: Check d-separation for some nodes in the graph
conditioning_set = {'C'}
is_d_separated = check_d_separation(G, 'A', 'B', conditioning_set)
print(f"Are A and B d-separated given C? {is_d_separated}")

# Step 4: Create a Bayesian Network
# Define a simple Bayesian Network: A -> B, A -> C, C -> D
model = BayesianNetwork([('A', 'B'), ('A', 'C'), ('C', 'D')])

# Define conditional probability distributions (CPDs)
cpd_A = TabularCPD(variable='A', variable_card=2, values=[[0.6], [0.4]])  # P(A)
cpd_B = TabularCPD(variable='B', variable_card=2, values=[[0.7, 0.2], [0.3, 0.8]], evidence=['A'], evidence_card=[2])  # P(B|A)
cpd_C = TabularCPD(variable='C', variable_card=2, values=[[0.5], [0.5]])  # P(C)
cpd_D = TabularCPD(variable='D', variable_card=2, values=[[0.8, 0.3], [0.2, 0.7]], evidence=['C'], evidence_card=[2])  # P(D|C)

# Add CPDs to the model
model.add_cpds(cpd_A, cpd_B, cpd_C, cpd_D)

# Step 5: Perform inference using Variable Elimination
inference = VariableElimination(model)

# Perform a query: P(B | A)
prob_B_given_A = inference.query(variables=['B'], evidence={'A': 1})
print(prob_B_given_A)

# Step 6: Another example of conditional independence and d-separation
# Create a new graph example for testing
G2 = nx.DiGraph()
G2.add_edges_from([('A', 'B'), ('A', 'C'), ('C', 'B')])

# Check if A and B are d-separated given C in this new graph
conditioning_set2 = {'C'}
is_d_separated2 = check_d_separation(G2, 'A', 'B', conditioning_set2)
print(f"Are A and B d-separated given C in the second graph? {is_d_separated2}")


import networkx as nx

def check_d_separation(graph, node_a, node_b, conditioning_set):
    """
    Check if node_a and node_b are d-separated given conditioning set.
    
    Parameters:
    - graph: A directed acyclic graph (networkx.DiGraph).
    - node_a: The first node to check d-separation for.
    - node_b: The second node to check d-separation for.
    - conditioning_set: A set or list of nodes conditioning on which d-separation is checked.
    
    Returns:
    - True if node_a and node_b are d-separated given the conditioning set, else False.
    """
    # Make sure conditioning_set is a set (it should be iterable, e.g., a list or set)
    if isinstance(conditioning_set, str):
        conditioning_set = {conditioning_set}  # Convert a single string to a set
    
    # Use networkx's d-separation function to check if node_a and node_b are d-separated
    return nx.d_separated(graph, node_a, node_b, conditioning_set)

# Example Usage:
# Create a simple directed graph
graph = nx.DiGraph()

# Add edges (A -> B, A -> C, B -> D, etc.)
graph.add_edges_from([('A', 'B'), ('A', 'C'), ('B', 'D'), ('C', 'D')])

# Check if nodes A and D are d-separated given node B
result = check_d_separation(graph, 'A', 'D', {'B'})
print("Are A and D d-separated given B?", result)


![image-2.png](attachment:image-2.png)

Fig.25 We can view a graphical model (in this case a directed graph) as a ﬁlter in which a prob- ability distribution p(x) is allowed through the ﬁlter if, and only if, it satisﬁes the directed factorization property (8.5). The set of all possible probability distributions p(x) that pass through the ﬁlter is denoted DF . We can alternatively use the graph to ﬁlter distributions according to whether they respect all of the conditional independencies implied by the d-separation properties of the graph. The d-separation theorem says that it is the same set of distributions DF that will be allowed through this second kind of ﬁlter.

![image.png](attachment:image.png)

Fig.26 The Markov blanket of a node $x_i$ comprises the set of parents, children and co-parents of the node. It has the property that the conditional distribution of $x_i$ , conditioned on all the remaining variables in the graph, is dependent only on the variables in the Markov blanket.

## Maximum Likelihood and Naive Bayes

## Naive Bayes with Gaussian Densities

Using **maximum likelihood** estimation, we assume the data are drawn independently from the model. The solution is obtained by fitting the model for each class separately using the correspondingly labeled data.

### Gaussian Example
Suppose that the probability density within each class is chosen to be Gaussian. Under the **Naive Bayes assumption**, the covariance matrix for each Gaussian is diagonal. This implies that the contours of constant density within each class are axis-aligned ellipsoids. 

However, the marginal density is given by a superposition of these diagonal Gaussians, weighted by the class priors, and therefore does not factorize with respect to its components.

$$
p(x) = \sum_{k=1}^K p(C_k)p(x|C_k)
$$

This approach is helpful when the dimensionality $D$ of the input space is high, making density estimation in the full $D$-dimensional space more challenging. It is also useful if the input vector contains both discrete and continuous variables. For example:
- **Discrete variables**: Modeled using Bernoulli distributions.
- **Continuous variables**: Modeled using Gaussians.

### Strong Independence Assumptions
The conditional independence assumption in Naive Bayes is strong and may lead to poor representations of the class-conditional densities. However, **decision boundaries** may still yield good classification performance even if the assumption is not perfectly satisfied.

---

## Conditional Independence and d-Separation

We represent a joint probability distribution $p(x)$ using directed graphs, where:
1. **Factorization**: The graph decomposes $p(x)$ into a product of conditional probabilities.
2. **d-Separation**: The graph expresses conditional independence properties via the **d-separation criterion**.

The **d-separation theorem** states that these two representations are equivalent.

### Graphical Models as Filters
A directed graph can be viewed as a filter:
- It allows distributions $p(x)$ to pass if and only if they satisfy the factorization implied by the graph.

This set of distributions is denoted as $DF$, the set of directed factorizations.

---

## Markov Blanket

The **Markov blanket** of a node $x_i$ is the minimal set of nodes that isolates $x_i$ from the rest of the graph. It consists of:
1. **Parents** of $x_i$,
2. **Children** of $x_i$,
3. **Co-parents** (other parents of the children of $x_i$).

### Conditional Distribution
Using the factorization property:

$$
p(x_i | x_{j \neq i}) = \frac{p(x_1, \ldots, x_D)}{\int p(x_1, \ldots, x_D) \, dx_i}
$$

Only the following terms affect $p(x_i | x_{j \neq i})$:
- $p(x_i | \text{parents of } x_i)$,
- $p(x_k | \text{parents of } x_k)$, where $x_k$ is a child of $x_i$.

Thus, the Markov blanket ensures that $x_i$ is independent of all other variables in the graph when conditioned on its blanket.

### Illustration of Markov Blanket

The Markov blanket includes:
1. **Parents** of $x_i$,
2. **Children** of $x_i$,
3. **Co-parents** of $x_i$.

This ensures all dependencies involving $x_i$ are captured.

---

## Summary
- The **Naive Bayes assumption** simplifies modeling, especially for high-dimensional data, but relies on strong independence assumptions.
- Directed graphs represent **joint probability distributions** via **factorization** and **d-separation**.
- The **Markov blanket** isolates a node from the rest of the graph, making it sufficient for conditional independence.





In [3]:
import numpy as np
import networkx as nx
from scipy.stats import multivariate_normal


# Naive Bayes Classifier with Gaussian Assumption
class NaiveBayesGaussian:
    def __init__(self):
        self.class_priors = {}
        self.class_means = {}
        self.class_variances = {}
    
    def fit(self, X, y):
        """
        Fit the Naive Bayes model assuming Gaussian densities.
        
        Parameters:
        - X: numpy array of shape (n_samples, n_features), feature matrix.
        - y: numpy array of shape (n_samples,), class labels.
        """
        classes = np.unique(y)
        for cls in classes:
            X_cls = X[y == cls]
            self.class_priors[cls] = len(X_cls) / len(y)
            self.class_means[cls] = np.mean(X_cls, axis=0)
            self.class_variances[cls] = np.var(X_cls, axis=0)
    
    def predict(self, X):
        """
        Predict class labels for input data.
        
        Parameters:
        - X: numpy array of shape (n_samples, n_features).
        
        Returns:
        - numpy array of predicted class labels.
        """
        predictions = []
        for x in X:
            posteriors = {}
            for cls in self.class_priors:
                mean = self.class_means[cls]
                var = self.class_variances[cls]
                prior = np.log(self.class_priors[cls])
                likelihood = -0.5 * np.sum(np.log(2 * np.pi * var)) - 0.5 * np.sum(((x - mean) ** 2) / var)
                posteriors[cls] = prior + likelihood
            predictions.append(max(posteriors, key=posteriors.get))
        return np.array(predictions)


# Graph-Based Functions: d-Separation and Markov Blanket
def is_d_separated(graph, node_a, node_b, conditioning_set):
    """
    Check if two nodes are d-separated given a conditioning set.
    
    Parameters:
    - graph: networkx.DiGraph, the directed acyclic graph.
    - node_a: str, the first node.
    - node_b: str, the second node.
    - conditioning_set: set of nodes.
    
    Returns:
    - bool: True if node_a and node_b are d-separated, False otherwise.
    """
    return nx.d_separated(graph, {node_a}, {node_b}, set(conditioning_set))


def find_markov_blanket(graph, node):
    """
    Find the Markov blanket of a given node.
    
    Parameters:
    - graph: networkx.DiGraph, the directed acyclic graph.
    - node: str, the node for which to find the Markov blanket.
    
    Returns:
    - set: Markov blanket of the node.
    """
    parents = set(graph.predecessors(node))
    children = set(graph.successors(node))
    co_parents = set()
    for child in children:
        co_parents.update(set(graph.predecessors(child)))
    co_parents.discard(node)
    return parents.union(children).union(co_parents)


# Example Usage
if __name__ == "__main__":
    # Naive Bayes Example
    X = np.array([[1.5, 2.0], [1.2, 1.8], [3.5, 4.0], [3.8, 3.5]])
    y = np.array([0, 0, 1, 1])
    nb = NaiveBayesGaussian()
    nb.fit(X, y)
    predictions = nb.predict(X)
    print("Predictions:", predictions)

    # Directed Graph Example
    G = nx.DiGraph()
    G.add_edges_from([("A", "B"), ("A", "C"), ("B", "D"), ("C", "D"), ("E", "C")])
    print("Is d-separated (A, E | {B})?", is_d_separated(G, "A", "E", {"B"}))
    print("Markov blanket of node 'C':", find_markov_blanket(G, "C"))


Predictions: [0 0 1 1]
Is d-separated (A, E | {B})? True
Markov blanket of node 'C': {'B', 'A', 'E', 'D'}
