In [None]:
'''
 * Copyright (c) 2008 Radhamadhab Dalai
 *
 * Permission is hereby granted, free of charge, to any person obtaining a copy
 * of this software and associated documentation files (the "Software"), to deal
 * in the Software without restriction, including without limitation the rights
 * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 * copies of the Software, and to permit persons to whom the Software is
 * furnished to do so, subject to the following conditions:
 *
 * The above copyright notice and this permission notice shall be included in
 * all copies or substantial portions of the Software.
 *
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 * THE SOFTWARE.
'''

## Probabilistic Graphical Models

Probabilities play a central role in modern pattern recognition. We have seen in Chapter 1 that probability theory can be expressed in terms of two simple equations corresponding to the sum rule and the product rule. All of the probabilistic inference and learning manipulations discussed in this book, no matter how complex, amount to repeated application of these two equations. We could therefore proceed to formulate and solve complicated probabilistic models purely by algebraic manipulation. However, we shall find it highly advantageous to augment the analysis using diagrammatic representations of probability distributions, called **probabilistic graphical models**. These offer several useful properties:

1. They provide a simple way to visualize the structure of a probabilistic model and can be used to design and motivate new models.
2. Insights into the properties of the model, including conditional independence properties, can be obtained by inspection of the graph.
3. Complex computations, required to perform inference and learning in sophisticated models, can be expressed in terms of graphical manipulations, in which underlying mathematical expressions are carried along implicitly.

A graph comprises **nodes** (also called **vertices**) connected by **links** (also known as **edges** or **arcs**). In a probabilistic graphical model, each node represents a random variable (or group of random variables), and the links express probabilistic relationships between these variables. The graph then captures the way in which the joint distribution over all of the random variables can be decomposed into a product of factors each depending only on a subset of the variables.

We shall begin by discussing **Bayesian networks**, also known as directed graphical models, in which the links of the graphs have a particular directionality indicated by arrows. The other major class of graphical models are **Markov random fields**, also known as **undirected graphical models**, in which the links do not carry arrows and have no directional significance. Directed graphs are useful for expressing causal relationships between random variables, whereas undirected graphs are better suited to expressing soft constraints between random variables.

For the purposes of solving inference problems, it is often convenient to convert both directed and undirected graphs into a different representation called a **factor graph**. In this chapter, we shall focus on the key aspects of graphical models as needed for applications in pattern recognition and machine learning. More general treatments of graphical models can be found in the books by Whittaker (1990), Lauritzen (1996), Jensen (1996), Castillo et al. (1997), Jordan (1999), Cowell et al. (1999), and Jordan (2007).

### Bayesian Networks

In order to motivate the use of directed graphs to describe probability distributions, consider first an arbitrary joint distribution $ p(a, b, c) $ over three variables $ a $, $ b $, and $ c $. Note that at this stage, we do not need to specify anything further about these variables, such as whether they are discrete or continuous. Indeed, one of the powerful aspects of graphical models is that a specific graph can make probabilistic statements for a broad class of distributions.

By application of the product rule of probability, we can write the joint distribution in the form:

$$ p(a, b, c) = p(c | a, b) p(a, b) $$

A second application of the product rule, this time to the second term on the right-hand side of the equation, gives:

$$ p(a, b, c) = p(c | a, b) p(b | a) p(a) $$

Note that this decomposition holds for any choice of the joint distribution. We now represent the right-hand side of this equation in terms of a simple graphical model as follows.

#### Example of a Directed Graph Representation

First, we introduce a node for each of the random variables $ a $, $ b $, and $ c $ and associate each node with the corresponding conditional distribution on the right-hand side of the equation. For the factor $ p(c | a, b) $, there will be links from nodes $ a $ and $ b $ to node $ c $, whereas for the factor $ p(a) $, there will be no incoming links. The result is the graph shown below:

$$
\begin{array}{c}
a \rightarrow b \rightarrow c \\
\end{array}
$$

If there is a link going from a node $ a $ to a node $ b $, then we say that node $ a $ is the parent of node $ b $, and we say that node $ b $ is the child of node $ a $. Note that we shall not make any formal distinction between a node and the variable to which it corresponds but will simply use the same symbol to refer to both.

#### Joint Distribution of Multiple Variables

Consider the graph for a joint distribution over $ K $ variables given by:

$$ p(x_1, \dots, x_K) = p(x_K | x_1, \dots, x_{K-1}) \dots p(x_2 | x_1) p(x_1) $$

For a given choice of $ K $, we can represent this as a directed graph having $ K $ nodes, one for each conditional distribution on the right-hand side of the equation, with each node having incoming links from all lower-numbered nodes. We say that this graph is **fully connected** because there is a link between every pair of nodes.

### Absence of Links and Independence

So far, we have worked with completely general joint distributions, so that the decompositions, and their representations as fully connected graphs, will be applicable to any choice of distribution. As we shall see shortly, it is the **absence of links** in the graph that conveys interesting information about the properties of the class of distributions that the graph represents.

Consider the graph shown below. This is not a fully connected graph because, for instance, there is no link from $ x_1 $ to $ x_2 $ or from $ x_3 $ to $ x_7 $.

### Example Graph

$$
\begin{array}{c}
x_1 \rightarrow x_2 \leftarrow x_3 \rightarrow x_4 \\
x_5 \leftarrow x_1 \rightarrow x_6 \\
x_7 \rightarrow x_3
\end{array}
$$

We shall now go from this graph to the corresponding representation of the joint probability distribution written in terms of the product of a set of conditional distributions, one for each node in the graph. Each such conditional distribution will be conditioned only on the parents of the corresponding node in the graph. For instance, $ x_5 $ will be conditioned on $ x_1 $ and $ x_3 $.

![image.png](attachment:image.png)

Fig.1 A directed graphical model representing the joint probability distribution over three variables a, b, and c, correspond- ing to the decomposition on the right-hand side of (8.2).


## Probabilistic Graphical Models

Probabilities play a central role in modern pattern recognition. We have seen in Chapter 1 that probability theory can be expressed in terms of two simple equations corresponding to the sum rule and the product rule. All of the probabilistic inference and learning manipulations discussed in this book, no matter how complex, amount to repeated application of these two equations. We could therefore proceed to formulate and solve complicated probabilistic models purely by algebraic manipulation. However, we shall find it highly advantageous to augment the analysis using diagrammatic representations of probability distributions, called **probabilistic graphical models**. These offer several useful properties:

1. They provide a simple way to visualize the structure of a probabilistic model and can be used to design and motivate new models.
2. Insights into the properties of the model, including conditional independence properties, can be obtained by inspection of the graph.
3. Complex computations, required to perform inference and learning in sophisticated models, can be expressed in terms of graphical manipulations, in which underlying mathematical expressions are carried along implicitly.

A graph comprises **nodes** (also called **vertices**) connected by **links** (also known as **edges** or **arcs**). In a probabilistic graphical model, each node represents a random variable (or group of random variables), and the links express probabilistic relationships between these variables. The graph then captures the way in which the joint distribution over all of the random variables can be decomposed into a product of factors each depending only on a subset of the variables.

We shall begin by discussing **Bayesian networks**, also known as directed graphical models, in which the links of the graphs have a particular directionality indicated by arrows. The other major class of graphical models are **Markov random fields**, also known as **undirected graphical models**, in which the links do not carry arrows and have no directional significance. Directed graphs are useful for expressing causal relationships between random variables, whereas undirected graphs are better suited to expressing soft constraints between random variables.

For the purposes of solving inference problems, it is often convenient to convert both directed and undirected graphs into a different representation called a **factor graph**. In this chapter, we shall focus on the key aspects of graphical models as needed for applications in pattern recognition and machine learning. More general treatments of graphical models can be found in the books by Whittaker (1990), Lauritzen (1996), Jensen (1996), Castillo et al. (1997), Jordan (1999), Cowell et al. (1999), and Jordan (2007).

### Bayesian Networks

In order to motivate the use of directed graphs to describe probability distributions, consider first an arbitrary joint distribution $ p(a, b, c) $ over three variables $ a $, $ b $, and $ c $. Note that at this stage, we do not need to specify anything further about these variables, such as whether they are discrete or continuous. Indeed, one of the powerful aspects of graphical models is that a specific graph can make probabilistic statements for a broad class of distributions.

By application of the product rule of probability, we can write the joint distribution in the form:

$$ p(a, b, c) = p(c | a, b) p(a, b) $$

A second application of the product rule, this time to the second term on the right-hand side of the equation, gives:

$$ p(a, b, c) = p(c | a, b) p(b | a) p(a) $$

Note that this decomposition holds for any choice of the joint distribution. We now represent the right-hand side of this equation in terms of a simple graphical model as follows.

#### Example of a Directed Graph Representation

First, we introduce a node for each of the random variables $ a $, $ b $, and $ c $ and associate each node with the corresponding conditional distribution on the right-hand side of the equation. For the factor $ p(c | a, b) $, there will be links from nodes $ a $ and $ b $ to node $ c $, whereas for the factor $ p(a) $, there will be no incoming links. The result is the graph shown below:

$$
\begin{array}{c}
a \rightarrow b \rightarrow c \\
\end{array}
$$

If there is a link going from a node $ a $ to a node $ b $, then we say that node $ a $ is the parent of node $ b $, and we say that node $ b $ is the child of node $ a $. Note that we shall not make any formal distinction between a node and the variable to which it corresponds but will simply use the same symbol to refer to both.

#### Joint Distribution of Multiple Variables

Consider the graph for a joint distribution over $ K $ variables given by:

$$ p(x_1, \dots, x_K) = p(x_K | x_1, \dots, x_{K-1}) \dots p(x_2 | x_1) p(x_1) $$

For a given choice of $ K $, we can represent this as a directed graph having $ K $ nodes, one for each conditional distribution on the right-hand side of the equation, with each node having incoming links from all lower-numbered nodes. We say that this graph is **fully connected** because there is a link between every pair of nodes.

### Absence of Links and Independence

So far, we have worked with completely general joint distributions, so that the decompositions, and their representations as fully connected graphs, will be applicable to any choice of distribution. As we shall see shortly, it is the **absence of links** in the graph that conveys interesting information about the properties of the class of distributions that the graph represents.

Consider the graph shown below. This is not a fully connected graph because, for instance, there is no link from $ x_1 $ to $ x_2 $ or from $ x_3 $ to $ x_7 $.

### Example Graph

$$
\begin{array}{c}
x_1 \rightarrow x_2 \leftarrow x_3 \rightarrow x_4 \\
x_5 \leftarrow x_1 \rightarrow x_6 \\
x_7 \rightarrow x_3
\end{array}
$$

We shall now go from this graph to the corresponding representation of the joint probability distribution written in terms of the product of a set of conditional distributions, one for each node in the graph. Each such conditional distribution will be conditioned only on the parents of the corresponding node in the graph. For instance, $ x_5 $ will be conditioned on $ x_1 $ and $ x_3 $.


In [1]:
import random

# Define the conditional probability tables (CPTs)
# Let's assume a simple case with 3 variables: A, B, C
# Example of conditional probabilities (in real-world applications, these would come from data)

# P(A)
def p_A():
    return 0.5  # Assume A has a 50% chance of being True

# P(B | A)
def p_B_given_A(a):
    if a:
        return 0.8  # If A is True, B has an 80% chance of being True
    else:
        return 0.4  # If A is False, B has a 40% chance of being True

# P(C | A, B)
def p_C_given_A_B(a, b):
    if a and b:
        return 0.9  # If both A and B are True, C has a 90% chance of being True
    elif a and not b:
        return 0.7  # If A is True and B is False, C has a 70% chance of being True
    elif not a and b:
        return 0.6  # If A is False and B is True, C has a 60% chance of being True
    else:
        return 0.3  # If both A and B are False, C has a 30% chance of being True

# Define a function to compute the joint probability
def joint_probability(a, b, c):
    return p_A() * p_B_given_A(a) * p_C_given_A_B(a, b)

# Sample from the network
def sample():
    a = random.random() < p_A()  # Sample A based on its marginal probability
    b = random.random() < p_B_given_A(a)  # Sample B based on the value of A
    c = random.random() < p_C_given_A_B(a, b)  # Sample C based on A and B
    return a, b, c

# Generate a sample
sampled_values = sample()
print("Sampled Values: A={}, B={}, C={}".format(sampled_values[0], sampled_values[1], sampled_values[2]))

# Calculate the joint probability of a specific configuration of values
a_val = True
b_val = False
c_val = True
prob = joint_probability(a_val, b_val, c_val)
print("Joint probability for A={}, B={}, C={} is: {:.4f}".format(a_val, b_val, c_val, prob))


Sampled Values: A=True, B=False, C=True
Joint probability for A=True, B=False, C=True is: 0.2800


![image-6.png](attachment:image-6.png)

Fig.2 Example of a directed acyclic graph describing the joint distribution over variables x1 , . . . , x7 . The corresponding decomposition of the joint distribution is given by (8.4).



![image-5.png](attachment:image-5.png)

Fig.3 Directed graphical model representing the joint distribution (8.6) corresponding to the Bayesian polynomial regression model introduced in Section 1.2.6.

![image-4.png](attachment:image-4.png)

Fig.4 An alternative, more compact, representation of the graph shown in Fig.3 in which we have introduced a plate (the box labelled N ) that represents N nodes of which only a single example tn is shown explicitly.

![image-3.png](attachment:image-3.png)

Fig.5 This shows the same model as in Fig.4 but with the deterministic parameters shown explicitly by the smaller solid nodes.

![image-2.png](attachment:image-2.png)

Fig.6 As in Fig.5 but with the nodes {tn } shaded to indicate that the corresponding random vari- ables have been set to their observed (training set) values.


![image.png](attachment:image.png)

Fig.7 The polynomial regression model, corresponding to Fig.6, showing also a new input value xb together with the corresponding model prediction $b_t$.

## Bayesian Polynomial Regression - Directed Graphical Model

### Step 1: Joint Distribution
The joint distribution over the observed data $ t = (t_1, t_2, ..., t_N)^T $ and the model parameters $ w $ is:

$$
p(t, w) = p(w) \prod_{n=1}^{N} p(t_n | w)
$$

Where:
- $ p(w) $ is the prior distribution of the polynomial coefficients.
- $ p(t_n | w) $ is the likelihood of the observation $ t_n $ given the polynomial coefficients $ w $.

### Step 2: Compact Representation Using Plates
The compact graphical representation with plates is:

$$
p(t, w | x, \alpha, \sigma^2) = p(w | \alpha) \prod_{n=1}^{N} p(t_n | w, x_n, \sigma^2)
$$

Where:
- $ x = (x_1, x_2, ..., x_N)^T $ are the input data points.
- $ \alpha $ is the precision of the Gaussian prior over \( w \).
- $ \sigma^2 $ is the noise variance.

### Step 3: Including Observed Data
For the observed data $ t $, the joint distribution conditioned on the observed data is:

$$
p(w | T) \propto p(w) \prod_{n=1}^{N} p(t_n | w)
$$

### Step 4: Predicting for a New Input $ \tilde{x} $
For predicting a new output $ \tilde{t} $ for a new input $ \tilde{x} $, the joint distribution is:

$$
p(\tilde{t}, t, w | \tilde{x}, x, \alpha, \sigma^2) = \prod_{n=1}^{N} p(t_n | x_n, w, \sigma^2) p(w | \alpha) p(\tilde{t} | \tilde{x}, w, \sigma^2)
$$

### Step 5: Posterior Distribution for $ w $
The posterior distribution of $ w $ given the observed data $ t $ is derived from Bayes' theorem:

$$
p(w | T) \propto p(w) \prod_{n=1}^{N} p(t_n | w)
$$

### Step 6: Predictive Distribution for $ \tilde{t} $
Finally, the predictive distribution for $ \tilde{t} $ is:

$$
p(\tilde{t} | \tilde{x}, x, t, \alpha, \sigma^2) \propto \int p(\tilde{t}, t, w | \tilde{x}, x, \alpha, \sigma^2) \, dw
$$


In [None]:
# Import required libraries
import numpy as np
import matplotlib.pyplot as plt
import pymc3 as pm
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Generate synthetic data
np.random.seed(42)

# True coefficients for the polynomial (quadratic)
true_coeffs = np.array([1.0, -2.0, 3.0])
n_samples = 100

# Generate random input data (x)
X = np.random.uniform(-3, 3, size=(n_samples, 1))

# Create polynomial features
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)

# Add noise to the output
sigma = 1.0
y_true = np.dot(X_poly, true_coeffs) + np.random.normal(0, sigma, size=n_samples)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_poly, y_true, test_size=0.2, random_state=42)

# Define the model
with pm.Model() as model:
    # Prior for the polynomial coefficients
    alpha = 1.0  # Precision of the prior
    w = pm.Normal('w', mu=0, sd=alpha, shape=X_train.shape[1])
    
    # Likelihood (data model)
    sigma = pm.HalfNormal('sigma', sd=1)
    y_obs = pm.Normal('y_obs', mu=pm.math.dot(X_train, w), sd=sigma, observed=y_train)
    
    # Inference: using MCMC (Metropolis-Hastings) or Variational Inference
    trace = pm.sample(2000, return_inferencedata=False, tune=1000)

# Plot the trace for the coefficients
pm.traceplot(trace, var_names=['w', 'sigma'])
plt.show()

# Get posterior predictive samples
with model:
    posterior_predictive = pm.sample_posterior_predictive(trace, var_names=['w', 'sigma'])

# Compute the predicted values for the test set
y_pred_samples = np.dot(X_test, posterior_predictive['w'].T)

# Compute the mean and 95% confidence interval of the predictions
y_pred_mean = np.mean(y_pred_samples, axis=0)
y_pred_std = np.std(y_pred_samples, axis=0)

# Plot the results
plt.figure(figsize=(10, 6))
plt.plot(X_test[:, 1], y_test, 'b.', label='Test data')
plt.plot(X_test[:, 1], y_pred_mean, 'r-', label='Posterior predictive mean')
plt.fill_between(X_test[:, 1], y_pred_mean - 1.96 * y_pred_std, y_pred_mean + 1.96 * y_pred_std, color='r', alpha=0.2, label='95% CI')
plt.legend()
plt.xlabel('Input (x)')
plt.ylabel('Output (y)')
plt.title('Bayesian Polynomial Regression Predictions')
plt.show()

# Evaluate the model performance
mse = mean_squared_error(y_test, y_pred_mean)
r2 = r2_score(y_test, y_pred_mean)

print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')


## Generative Models and Discrete Variables

###  Generative Models

Generative models describe the process by which data is generated. Ancestral sampling is a method for generating samples from a probability distribution defined by a directed acyclic graph (DAG).

### Joint Distribution Factorization
The joint distribution over $K$ variables factorizes as:
$$
p(x_1, x_2, \ldots, x_K) = \prod_{k=1}^K p(x_k | \text{pa}_k),
$$
where $\text{pa}_k$ represents the parents of node $x_k$.

### Ancestral Sampling Algorithm
1. **Order nodes** such that each node $n$ has a higher index than its parents.
2. **Sample from the priors**:
   - Start from the lowest-numbered node.
   - Sample $x_1 \sim p(x_1)$.
3. **Conditioned sampling**:
   - For each node $n$, sample $x_n \sim p(x_n | \text{pa}_n)$.
4. **Generate marginal samples**:
   - To sample from a subset, only retain the relevant node values.

### Example: Object Recognition
The generative process for an object recognition task can be represented by the graphical model:
- **Latent variables**:
  - Object identity (discrete variable).
  - Position and orientation (continuous variables).
- **Observed variables**:
  - Image intensities (vector).

### Key Points
- Generative models mimic the data generation process and are often used to simulate "fantasy" data.
- To make a model generative, all input variables need probability distributions.

---

## 8.1.3 Discrete Variables

Discrete probability distributions can be effectively represented in graphical models. These models are used to construct joint distributions from simpler building blocks.

### Probability of a Discrete Variable
For a single discrete variable \(x\) with \(K\) states:
$$
p(x | \mu) = \prod_{k=1}^K \mu_k^{x_k},
$$
where:
- $\mu = (\mu_1, \ldots, \mu_K)$,
- $\sum_{k=1}^K \mu_k = 1$.

### Joint Distribution Over Two Variables
For two discrete variables \(x_1\) and \(x_2\), each with \(K\) states:
$$
p(x_1, x_2 | \mu) = \prod_{k=1}^K \prod_{l=1}^K \mu_{kl}^{x_{1k} x_{2l}},
$$
where $\mu_{kl}$ represents the joint probability of $x_1$ and $x_2$.

### Graphical Representation
- **Fully Connected Graph**:
  - All possible dependencies between variables.
  - $K^2 - 1$ parameters required.
- **Independent Graph**:
  - Variables are independent.
  - Total parameters = $2(K - 1)$.

### Example
Consider two discrete variables:
1. Fully connected graph (general joint distribution):
 
 ![image.png](attachment:image.png)
 

2. Independent variables:
   ![Independent Graph](example_graph_independent.png)

### Application to Large Models
For $M$ discrete variables, the parameter count grows as $K^M - 1$, but using graphical models reduces this complexity through conditional independence.


#### Generative Models
$$
p(x_1, x_2, \ldots, x_K) = \prod_{k=1}^K p(x_k | \text{pa}_k)
$$

#### Discrete Variables
Single discrete variable:
$$
p(x | \mu) = \prod_{k=1}^K \mu_k^{x_k}, \quad \text{with } \sum_{k=1}^K \mu_k = 1.
$$

Joint distribution for two variables:
$$
p(x_1, x_2 | \mu) = \prod_{k=1}^K \prod_{l=1}^K \mu_{kl}^{x_{1k} x_{2l}}.
$$

---

### Implementation in Python

#Ancestral sampling and discrete probability distributions can be implemented as follows:



In [None]:

import numpy as np

def ancestral_sampling(dag, conditional_distributions):
    """
    Perform ancestral sampling on a directed acyclic graph (DAG).
    
    Args:
        dag (list): List of parent-child relationships.
        conditional_distributions (dict): Mapping of nodes to conditional sampling functions.
        
    Returns:
        dict: Sampled values for each node.
    """
    sampled_values = {}
    for node in dag:
        parents = dag[node]
        parent_values = {p: sampled_values[p] for p in parents}
        sampled_values[node] = conditional_distributions[node](**parent_values)
    return sampled_values

# Example: Sampling from p(x1, x2) where x2 depends on x1
def p_x1():
    return np.random.choice([0, 1], p=[0.6, 0.4])

def p_x2_given_x1(x1):
    return np.random.choice([0, 1], p=[0.7, 0.3] if x1 == 0 else [0.2, 0.8])

dag = {
    "x1": [],
    "x2": ["x1"]
}

conditional_distributions = {
    "x1": lambda: p_x1(),
    "x2": lambda x1: p_x2_given_x1(x1)
}

samples = ancestral_sampling(dag, conditional_distributions)
print("Sampled values:", samples)
