In [None]:
'''
 * Copyright (c) 2016 Radhamadhab Dalai
 *
 * Permission is hereby granted, free of charge, to any person obtaining a copy
 * of this software and associated documentation files (the "Software"), to deal
 * in the Software without restriction, including without limitation the rights
 * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 * copies of the Software, and to permit persons to whom the Software is
 * furnished to do so, subject to the following conditions:
 *
 * The above copyright notice and this permission notice shall be included in
 * all copies or substantial portions of the Software.
 *
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 * THE SOFTWARE.
'''

# Discrete and Continuous Probabilities (Continued)

###  Discrete Probabilities
![image.png](attachment:image.png)
Fig.2 Visualization of a discrete bivariate probability mass function, with random variables X and Y . 

**Fig.2: Discrete Bivariate Probability Mass Function**
Visualization of a discrete bivariate probability mass function, with random variables $X$ and $Y$. This diagram is adapted from Bishop (2006).*

The probability that $X = x$ and $Y = y$ is (lazily) written as $p(x, y)$ and is called the **joint probability**. One can think of a probability as a function that takes state $x$ and $y$ and returns a real number, which is the reason we write $p(x, y)$.

The **marginal probability** that $X$ takes the value $x$ irrespective of the value of random variable $Y$ is (lazily) written as $p(x)$. We write $X \sim p(x)$ to denote that the random variable $X$ is distributed according to $p(x)$.

If we consider only the instances where $X = x$, then the fraction of instances (the **conditional probability**) for which $Y = y$ is written (lazily) as $p(y | x)$.

### Example 2

Consider two random variables $X$ and $Y$, where $X$ has five possible states and $Y$ has three possible states, as shown in Figure 6.2. We denote by $n_{ij}$ the number of events with state $X = x_i$ and $Y = y_j$, and denote by $N$ the total number of events. The value $c_i$ is the sum of the individual frequencies for the $i$-th column, that is, $c_i = \sum_{j=1}^3 n_{ij}$. Similarly, the value $r_j$ is the row sum, that is, $r_j = \sum_{i=1}^5 n_{ij}$.

Using these definitions, we can compactly express the distribution of $X$ and $Y$. The probability distribution of each random variable, the marginal probability, can be seen as the sum over a row or column:

$$P(X = x_i) = \frac{\sum_{j=1}^3 n_{ij}}{N} = \frac{c_i}{N} \quad \text{(6.10)}$$

and

$$P(Y = y_j) = \frac{\sum_{i=1}^5 n_{ij}}{N} = \frac{r_j}{N} \quad \text{(6.11)}$$

Where $c_i$ and $r_j$ are the $i$-th column and $j$-th row sum of the frequency table, respectively. By convention, for discrete random variables with a finite number of events, we assume that probabilities sum up to one, that is:

$$\sum_{i=1}^5 P(X = x_i) = 1 \quad \text{and} \quad \sum_{j=1}^3 P(Y = y_j) = 1 \quad \text{(6.12)}$$

The conditional probability is the fraction of a row or column in a particular cell. For example, the conditional probability of $Y$ given $X$ is:

$$P(Y = y_j | X = x_i) = \frac{n_{ij}}{c_i} \quad \text{(6.13)}$$

And the conditional probability of $X$ given $Y$ is:

$$P(X = x_i | Y = y_j) = \frac{n_{ij}}{r_j} \quad \text{(6.14)}$$

In machine learning, we use discrete probability distributions to model categorical variables, i.e., variables that take a finite set of unordered values. They could be categorical features, such as the degree taken at university when used for predicting the salary of a person, or categorical labels, such as letters of the alphabet when doing handwriting recognition. Discrete distributions are also often used to construct probabilistic models that combine a finite number of continuous distributions.





In [1]:
# --- 1. Define the states for X and Y ---
x_states = ['x1', 'x2', 'x3']
y_states = ['y1', 'y2']

# --- 2. Create a synthetic frequency table (nij) ---
# This represents the counts of events for each (X=xi, Y=yj) pair.
# The structure will be nij[i][j] where i corresponds to x_states index and j to y_states index.
# Let's make up some numbers for illustration:
#        y1  y2
#   x1 [ 10, 20 ]
#   x2 [ 15,  5 ]
#   x3 [ 25, 30 ]

nij = [
    [10, 20],  # Counts for x1: (x1,y1)=10, (x1,y2)=20
    [15, 5],   # Counts for x2: (x2,y1)=15, (x2,y2)=5
    [25, 30]   # Counts for x3: (x3,y1)=25, (x3,y2)=30
]

print("--- 1. Synthetic Frequency Table (nij) ---")
print("   " + " ".join([f"{y:<4}" for y in y_states]))
print("   " + "-" * (len(y_states) * 5 - 1))
for i, x_state in enumerate(x_states):
    row_str = f"{x_state} | "
    for count in nij[i]:
        row_str += f"{count:<4} "
    print(row_str)
print("-" * 40)

# --- 3. Calculate Total Number of Events (N) ---
N = sum(sum(row) for row in nij)
print(f"Total number of events (N): {N}")
print("-" * 40)

# --- 4. Calculate Joint Probability p(x, y) ---
# p(x, y) = nij / N
joint_prob_table = [[0.0 for _ in y_states] for _ in x_states]

print("--- 4. Joint Probability Table p(x, y) ---")
print("   " + " ".join([f"{y:<6}" for y in y_states]))
print("   " + "-" * (len(y_states) * 7 - 1))
for i in range(len(x_states)):
    row_str = f"{x_states[i]} | "
    for j in range(len(y_states)):
        joint_prob_table[i][j] = nij[i][j] / N
        row_str += f"{joint_prob_table[i][j]:<6.3f} "
    print(row_str)
print("-" * 40)

# --- 5. Calculate Marginal Probabilities ---

# Marginal probability p(x_i) = c_i / N (column sums from original nij table)
# c_i in the text refers to sum over j of nij for a fixed i (row sum of nij)
# The text's notation for c_i and r_j might be slightly confusing with row/column.
# Let's clarify:
# c_i (text) = sum_j nij for fixed i = sum of counts for specific X=xi (row sum)
# r_j (text) = sum_i nij for fixed j = sum of counts for specific Y=yj (column sum)

# Calculate c_i (sum of nij for a fixed X_i - row sums of nij)
ci_sums = [sum(row) for row in nij]
print("--- 5.1. Marginal Probabilities p(x) ---")
print("Row sums of nij (c_i):", ci_sums)

marginal_px = {}
for i, x_state in enumerate(x_states):
    marginal_px[x_state] = ci_sums[i] / N
    print(f"P(X = {x_state}) = {marginal_px[x_state]:.3f}")

# Verify sum to 1
sum_px = sum(marginal_px.values())
print(f"Sum P(X) = {sum_px:.3f} (should be 1)")
print("-" * 40)

# Calculate r_j (sum of nij for a fixed Y_j - column sums of nij)
rj_sums = [0] * len(y_states)
for i in range(len(x_states)):
    for j in range(len(y_states)):
        rj_sums[j] += nij[i][j]
print("--- 5.2. Marginal Probabilities p(y) ---")
print("Column sums of nij (r_j):", rj_sums)

marginal_py = {}
for j, y_state in enumerate(y_states):
    marginal_py[y_state] = rj_sums[j] / N
    print(f"P(Y = {y_state}) = {marginal_py[y_state]:.3f}")

# Verify sum to 1
sum_py = sum(marginal_py.values())
print(f"Sum P(Y) = {sum_py:.3f} (should be 1)")
print("-" * 40)

# --- 6. Calculate Conditional Probabilities ---

# Conditional Probability P(Y = yj | X = xi) = nij / c_i
print("--- 6.1. Conditional Probability P(Y | X) ---")
conditional_py_given_x = {}
for i, x_state in enumerate(x_states):
    if ci_sums[i] == 0: # Avoid division by zero if a row sum is 0
        print(f"Cannot calculate P(Y | X = {x_state}) as P(X = {x_state}) is 0.")
        continue
    for j, y_state in enumerate(y_states):
        prob = nij[i][j] / ci_sums[i]
        conditional_py_given_x[(y_state, x_state)] = prob
        print(f"P(Y = {y_state} | X = {x_state}) = {prob:.3f}")
    # Verify sum for each condition
    current_sum = sum(conditional_py_given_x[(y_state, x_state)] for y_state in y_states)
    print(f"  Sum for X={x_state}: {current_sum:.3f} (should be 1)")
print("-" * 40)


# Conditional Probability P(X = xi | Y = yj) = nij / r_j
print("--- 6.2. Conditional Probability P(X | Y) ---")
conditional_px_given_y = {}
for j, y_state in enumerate(y_states):
    if rj_sums[j] == 0: # Avoid division by zero if a column sum is 0
        print(f"Cannot calculate P(X | Y = {y_state}) as P(Y = {y_state}) is 0.")
        continue
    for i, x_state in enumerate(x_states):
        prob = nij[i][j] / rj_sums[j]
        conditional_px_given_y[(x_state, y_state)] = prob
        print(f"P(X = {x_state} | Y = {y_state}) = {prob:.3f}")
    # Verify sum for each condition
    current_sum = sum(conditional_px_given_y[(x_state, y_state)] for x_state in x_states)
    print(f"  Sum for Y={y_state}: {current_sum:.3f} (should be 1)")
print("-" * 40)

--- 1. Synthetic Frequency Table (nij) ---
   y1   y2  
   ---------
x1 | 10   20   
x2 | 15   5    
x3 | 25   30   
----------------------------------------
Total number of events (N): 105
----------------------------------------
--- 4. Joint Probability Table p(x, y) ---
   y1     y2    
   -------------
x1 | 0.095  0.190  
x2 | 0.143  0.048  
x3 | 0.238  0.286  
----------------------------------------
--- 5.1. Marginal Probabilities p(x) ---
Row sums of nij (c_i): [30, 20, 55]
P(X = x1) = 0.286
P(X = x2) = 0.190
P(X = x3) = 0.524
Sum P(X) = 1.000 (should be 1)
----------------------------------------
--- 5.2. Marginal Probabilities p(y) ---
Column sums of nij (r_j): [50, 55]
P(Y = y1) = 0.476
P(Y = y2) = 0.524
Sum P(Y) = 1.000 (should be 1)
----------------------------------------
--- 6.1. Conditional Probability P(Y | X) ---
P(Y = y1 | X = x1) = 0.333
P(Y = y2 | X = x1) = 0.667
  Sum for X=x1: 1.000 (should be 1)
P(Y = y1 | X = x2) = 0.750
P(Y = y2 | X = x2) = 0.250
  Sum for X=x2

# Discrete and Continuous Probabilities (Continued)

### 6.2.2 Continuous Probabilities

We consider real-valued random variables in this section, i.e., we consider target spaces that are intervals of the real line $\mathbb{R}$. In this book, we pretend that we can perform operations on real random variables as if we have discrete probability spaces with finite states. However, this simplification is not precise for two situations: when we repeat something infinitely often, and when we want to draw a point from an interval. The first situation arises when we discuss generalization errors in machine learning (Chapter 8). The second situation arises when we want to discuss continuous distributions, such as the Gaussian (Section 6.5). For our purposes, the lack of precision allows for a briefer introduction to probability.

**Remark.** In continuous spaces, there are two additional technicalities, which are counterintuitive. First, the set of all subsets (used to define the event space $\mathcal{A}$ in Section 6.1) is not well behaved enough. $\mathcal{A}$ needs to be restricted to behave well under set complements, set intersections, and set unions. Second, the size of a set (which in discrete spaces can be obtained by counting the elements) turns out to be tricky. The size of a set is called its **measure**. For example, the cardinality of discrete sets, the length of an interval in $\mathbb{R}$, and the volume of a region in $\mathbb{R}^d$ are all measures. Sets that behave well under set operations and additionally have a topology are called a **Borel $\sigma$-algebra**. Betancourt details a careful construction of probability spaces from set theory without being bogged down in technicalities; see https://tinyurl.com/yb3t6mfd. For a more precise construction, we refer to Billingsley (1995) and Jacod and Protter (2004).

In this book, we consider real-valued random variables with their corresponding Borel $\sigma$-algebra. We consider random variables with values in $\mathbb{R}^D$ to be a vector of real-valued random variables. $\diamondsuit$

---

**Definition 6.1 (Probability Density Function).**
A function $f : \mathbb{R}^D \to \mathbb{R}$ is called a **probability density function (pdf)** if:
1.  $\forall \mathbf{x} \in \mathbb{R}^D : f(\mathbf{x}) \geq 0$
2.  Its integral exists and
    $$\int_{\mathbb{R}^D} f(\mathbf{x}) d\mathbf{x} = 1 \quad \text{(6.15)}$$

For probability mass functions (pmf) of discrete random variables, the integral in (6.15) is replaced with a sum (6.12). Observe that the probability density function is any function $f$ that is non-negative and integrates to one.

We associate a random variable $X$ with this function $f$ by:

$$P(a \leq X \leq b) = \int_a^b f(x) dx \quad \text{(6.16)}$$

where $a, b \in \mathbb{R}$ and $x \in \mathbb{R}$ are outcomes of the continuous random variable $X$. States $\mathbf{x} \in \mathbb{R}^D$ are defined analogously by considering a vector of $\mathbf{x} \in \mathbb{R}$. This association (6.16) is called the **law** or **distribution** of the random variable $X$.

**Remark.** In contrast to discrete random variables, the probability of a continuous random variable $X$ taking a particular value $P(X = x)$ is zero. This is like trying to specify an interval in (6.16) where $a = b$. $P(X=x)$ is a set of measure zero. $\diamondsuit$

---

**Definition 6.2 (Cumulative Distribution Function).**
A **cumulative distribution function (cdf)** of a multivariate real-valued random variable $X$ with states $\mathbf{x} \in \mathbb{R}^D$ is given by:

$$F_X(\mathbf{x}) = P(X_1 \leq x_1, \ldots, X_D \leq x_D) \quad \text{(6.17)}$$

where $X = [X_1, \ldots, X_D]^\top$, $\mathbf{x} = [x_1, \ldots, x_D]^\top$, and the right-hand side represents the probability that random variable $X_i$ takes the value smaller than or equal to $x_i$.

The cdf can be expressed also as the integral of the probability density function $f(\mathbf{x})$ so that:

$$F_X(\mathbf{x}) = \int_{-\infty}^{x_1} \cdots \int_{-\infty}^{x_D} f(z_1, \ldots, z_D) dz_1 \cdots dz_D \quad \text{(6.18)}$$

There are cdfs which do not have corresponding pdfs.

**Remark.** We reiterate that there are in fact two distinct concepts when talking about distributions. First is the idea of a pdf (denoted by $f(x)$), which is a nonnegative function that sums to one. Second is the law of a random variable $X$, that is, the association of a random variable $X$ with the pdf $f(x)$.

![image.png](attachment:image.png)

Fig.3 Examples of (a) discrete and (b) continuous uniform distributions. See Example 6.3 for details of the distributions.



In [2]:
# --- 1. Define a Probability Density Function (PDF) ---

class UniformPDF:
    def __init__(self, lower_bound, upper_bound):
        if lower_bound >= upper_bound:
            raise ValueError("Lower bound must be less than upper bound.")
        self.a = lower_bound
        self.b = upper_bound
        self.density = 1.0 / (self.b - self.a)

    def __call__(self, x):
        """
        Evaluates the PDF at a given point x.
        f(x) = 1 / (b-a) for a <= x <= b, else 0
        """
        if self.a <= x <= self.b:
            return self.density
        else:
            return 0.0

    def check_normalization(self, num_samples=100000):
        """
        Numerically approximates the integral of the PDF to check if it sums to 1.
        Uses a simple Riemann sum.
        """
        # Define the range for integration, slightly wider than [a, b] for robustness
        integral_range_start = self.a - (self.b - self.a) * 0.1
        integral_range_end = self.b + (self.b - self.a) * 0.1
        
        step_size = (integral_range_end - integral_range_start) / num_samples
        approx_integral = 0.0
        
        for i in range(num_samples):
            x = integral_range_start + i * step_size
            approx_integral += self(x) * step_size
            
        return approx_integral

# --- 2. Define a Cumulative Distribution Function (CDF) ---

class UniformCDF:
    def __init__(self, lower_bound, upper_bound):
        if lower_bound >= upper_bound:
            raise ValueError("Lower bound must be less than upper bound.")
        self.a = lower_bound
        self.b = upper_bound

    def __call__(self, x):
        """
        Evaluates the CDF at a given point x.
        F(x) = 0 for x < a
        F(x) = (x-a) / (b-a) for a <= x <= b
        F(x) = 1 for x > b
        """
        if x < self.a:
            return 0.0
        elif x > self.b:
            return 1.0
        else:
            return (x - self.a) / (self.b - self.a)

# --- 3. Demonstrating P(a <= X <= b) using PDF and CDF ---

# Let's set up a Uniform Distribution from 0 to 1
my_pdf = UniformPDF(lower_bound=0, upper_bound=1)
my_cdf = UniformCDF(lower_bound=0, upper_bound=1)

print("--- Continuous Probability Demonstration (Uniform Distribution) ---")
print(f"Distribution: Uniform from {my_pdf.a} to {my_pdf.b}")
print(f"PDF value f(x) = {my_pdf.density} for x in [{my_pdf.a}, {my_pdf.b}]")
print("-" * 60)

# Check PDF normalization (conceptual integral)
print("1. Checking PDF Normalization (Integral f(x)dx = 1):")
num_integration_points = 100000
approx_integral = my_pdf.check_normalization(num_integration_points)
print(f"Approximate integral of PDF: {approx_integral:.4f}")
if abs(approx_integral - 1.0) < 0.01: # Use a small tolerance for numerical approximation
    print("PDF approximately integrates to 1 (passes check).")
else:
    print("Warning: PDF does not integrate to 1 properly.")
print("-" * 60)

# Calculate P(a <= X <= b) using the PDF integral (approximation)
print("2. Calculating P(a <= X <= b) using PDF (Numerical Integration):")
interval_start = 0.2
interval_end = 0.7
num_rectangles = 10000
step_size_interval = (interval_end - interval_start) / num_rectangles
prob_from_pdf_integral = 0.0

for i in range(num_rectangles):
    midpoint_x = interval_start + (i + 0.5) * step_size_interval
    prob_from_pdf_integral += my_pdf(midpoint_x) * step_size_interval

print(f"P({interval_start} <= X <= {interval_end}) (via PDF integral): {prob_from_pdf_integral:.4f}")
# Analytical solution for Uniform(0,1): (0.7 - 0.2) * 1 = 0.5

print("-" * 60)

# Calculate P(a <= X <= b) using the CDF
print("3. Calculating P(a <= X <= b) using CDF:")
prob_from_cdf = my_cdf(interval_end) - my_cdf(interval_start)
print(f"P({interval_start} <= X <= {interval_end}) (via CDF): {prob_from_cdf:.4f}")

print("-" * 60)

# Demonstrate P(X = x) for continuous variable (should be 0)
print("4. Demonstrating P(X = x) for a continuous variable:")
point_x = 0.5
# Analytically, P(X = x) for any single point x is 0 for continuous distributions.
# Our PDF function would return a non-zero value, but this is a density, not a probability.
# The 'probability' of a single point is conceptualized as an integral from x to x, which is 0.
print(f"P(X = {point_x}) for a continuous variable is 0.")
print(f"The PDF value f({point_x}) = {my_pdf(point_x)}. This is a density, not a probability.")
print("-" * 60)

# --- 4. Multivariate CDF (Conceptual) ---
# For a multivariate uniform distribution (e.g., in 2D)
class MultiVariateUniformCDF:
    def __init__(self, lower_bounds, upper_bounds):
        if len(lower_bounds) != len(upper_bounds):
            raise ValueError("Lower and upper bounds must have the same dimension.")
        self.lower_bounds = lower_bounds
        self.upper_bounds = upper_bounds
        self.dimension = len(lower_bounds)
        
        # Calculate volume for normalization (conceptual)
        self.volume = 1.0
        for i in range(self.dimension):
            self.volume *= (upper_bounds[i] - lower_bounds[i])
        
        if self.volume == 0:
            raise ValueError("Volume of the uniform space cannot be zero.")

    def __call__(self, x_vector):
        """
        Evaluates the CDF at a given vector point x = [x1, ..., xD].
        For a uniform distribution, this is the product of individual CDFs.
        FX(x) = P(X1 <= x1, ..., XD <= xD)
        """
        if len(x_vector) != self.dimension:
            raise ValueError(f"Input vector dimension ({len(x_vector)}) must match distribution dimension ({self.dimension}).")
            
        cumulative_prob = 1.0
        for i in range(self.dimension):
            xi = x_vector[i]
            ai = self.lower_bounds[i]
            bi = self.upper_bounds[i]
            
            if xi < ai:
                cumulative_prob *= 0.0 # Any dimension below its lower bound makes the total 0
            elif xi > bi:
                cumulative_prob *= 1.0 # Any dimension above its upper bound is fully covered
            else:
                cumulative_prob *= (xi - ai) / (bi - ai)
                
        return cumulative_prob

print("5. Demonstrating Multivariate CDF (Conceptual Uniform in 2D):")
# Example for a 2D uniform distribution over [0,1] x [0,1]
lower_bounds_2d = [0, 0]
upper_bounds_2d = [1, 1]
my_multivar_cdf = MultiVariateUniformCDF(lower_bounds_2d, upper_bounds_2d)

test_point_2d = [0.5, 0.5]
cdf_val_2d = my_multivar_cdf(test_point_2d)
print(f"F_X({test_point_2d}) = P(X1 <= {test_point_2d[0]}, X2 <= {test_point_2d[1]}) = {cdf_val_2d:.4f}")
# Expected for uniform [0,1]x[0,1]: (0.5-0)/(1-0) * (0.5-0)/(1-0) = 0.5 * 0.5 = 0.25

test_point_2d_outside = [1.5, 0.5]
cdf_val_2d_outside = my_multivar_cdf(test_point_2d_outside)
print(f"F_X({test_point_2d_outside}) = {cdf_val_2d_outside:.4f}") # Expected: 1.0 * 0.5 = 0.5

test_point_2d_all_outside = [1.5, 1.5]
cdf_val_2d_all_outside = my_multivar_cdf(test_point_2d_all_outside)
print(f"F_X({test_point_2d_all_outside}) = {cdf_val_2d_all_outside:.4f}") # Expected: 1.0 * 1.0 = 1.0

print("-" * 60)
print("\n--- Summary of Implementation ---")
print("This core Python implementation provides a conceptual understanding of PDFs and CDFs for continuous distributions.")
print("It uses a simple Uniform Distribution for clarity.")
print("Numerical integration is used to approximate integrals, highlighting the conceptual definition.")
print("For practical applications involving complex PDFs or accurate integration, libraries like NumPy and SciPy are indispensable.")

--- Continuous Probability Demonstration (Uniform Distribution) ---
Distribution: Uniform from 0 to 1
PDF value f(x) = 1.0 for x in [0, 1]
------------------------------------------------------------
1. Checking PDF Normalization (Integral f(x)dx = 1):
Approximate integral of PDF: 1.0000
PDF approximately integrates to 1 (passes check).
------------------------------------------------------------
2. Calculating P(a <= X <= b) using PDF (Numerical Integration):
P(0.2 <= X <= 0.7) (via PDF integral): 0.5000
------------------------------------------------------------
3. Calculating P(a <= X <= b) using CDF:
P(0.2 <= X <= 0.7) (via CDF): 0.5000
------------------------------------------------------------
4. Demonstrating P(X = x) for a continuous variable:
P(X = 0.5) for a continuous variable is 0.
The PDF value f(0.5) = 1.0. This is a density, not a probability.
------------------------------------------------------------
5. Demonstrating Multivariate CDF (Conceptual Uniform in 2D):
F_X(

# Discrete and Continuous Probabilities (Continued)

### 6.2.3 Contrasting Discrete and Continuous Distributions

Recall from Section 6.1.2 that probabilities are positive and the total probability sums up to one. For discrete random variables (see (6.12)), this implies that the probability of each state must lie in the interval $[0, 1]$. However, for continuous random variables the normalization (see (6.15)) does not imply that the value of the density is less than or equal to 1 for all values. We illustrate this in Figure 6.3 using the uniform distribution for both discrete and continuous random variables.

### Example 3

We consider two examples of the uniform distribution, where each state is equally likely to occur. This example illustrates some differences between discrete and continuous probability distributions.

Let $Z$ be a discrete uniform random variable with three states $\{z = -1.1, z = 0.3, z = 1.5\}$. The probability mass function can be represented as a table of probability values:

| $z$           | -1.1 | 0.3 | 1.5 |
| :------------ | :--- | :-- | :-- |
| $P(Z = z)$    | 1/3  | 1/3 | 1/3 |

Alternatively, we can think of this as a graph (Figure 6.3(a)), where we use the fact that the states can be located on the x-axis, and the y-axis represents the probability of a particular state. The y-axis in Figure 6.3(a) is deliberately extended so that it is the same as in Figure 6.3(b).

Let $X$ be a continuous random variable taking values in the range $0.9 \le X \le 1.6$, as represented by Figure 6.3(b). Observe that the height of the density can be greater than 1. However, it needs to hold that:

$$\int_{0.9}^{1.6} p(x)dx = 1 \quad \text{(6.19)}$$

**Remark.** There is an additional subtlety with regards to discrete probability distributions. The states $z_1, \ldots, z_d$ do not in principle have any structure, i.e., there is usually no way to compare them, for example $z_1 = \text{red}, z_2 = \text{green}, z_3 = \text{blue}$. However, in many machine learning applications discrete states take numerical values, e.g., $z_1 = -1.1, z_2 = 0.3, z_3 = 1.5$, where we could say $z_1 < z_2 < z_3$. Discrete states that assume numerical values are particularly useful because we often consider expected values (Section 6.4.1) of random variables. $\diamondsuit$

Unfortunately, machine learning literature uses notation and nomenclature that hides the distinction between the sample space $\Omega$, the target space $\mathcal{T}$, and the random variable $X$. For a value $x$ of the set of possible outcomes of the random variable $X$, i.e., $x \in \mathcal{T}$, $p(x)$ denotes the probability that random variable $X$ has the outcome $x$. For discrete random variables, this is written as $P(X = x)$, which is known as the probability mass function. The pmf is often referred to as the “distribution”. For continuous variables, $p(x)$ is called the probability density function (often referred to as a density). To muddy things even further, the cumulative distribution function $P(X \le x)$ is often also referred to as the “distribution”.

We think of the outcome $x$ as the argument that results in the probability $p(x)$.

In this chapter, we will use the notation $X$ to refer to both univariate and multivariate random variables, and denote the states by $x$ and $\mathbf{x}$ respectively. We summarize the nomenclature in Table 6.1.

---

**Table 6.1: Nomenclature for Probability Distributions**

| Type       | Point probability | Interval probability |
| :--------- | :---------------- | :------------------- |
| **Discrete** | $P(X = x)$        | Not applicable       |
|            | Probability mass function |                      |
| **Continuous** | $p(x)$            | $P(X \le x)$         |
|            | Probability density function | Cumulative distribution function |

---

**Remark.** We will be using the expression “probability distribution” not only for discrete probability mass functions but also for continuous probability density functions, although this is technically incorrect. In line with most machine learning literature, we also rely on context to distinguish the different uses of the phrase probability distribution.

In [4]:
import math

# --- 1. Discrete Uniform Distribution Class ---
class DiscreteUniformDistribution:
    def __init__(self, states):
        if not states:
            raise ValueError("States list cannot be empty for discrete distribution.")
        self.states = sorted(list(set(states))) # Ensure unique and sorted states
        self.num_states = len(self.states)
        self.probability_per_state = 1.0 / self.num_states

    def pmf(self, x):
        """
        Probability Mass Function (PMF) for a discrete uniform variable.
        P(Z = z)
        """
        if x in self.states:
            return self.probability_per_state
        else:
            return 0.0 # Probability is 0 for states not in the defined set

    def get_all_pmf_values(self):
        """Returns a dictionary of all states and their probabilities."""
        return {state: self.pmf(state) for state in self.states}

# --- 2. Continuous Uniform Distribution Class ---
class ContinuousUniformDistribution:
    def __init__(self, lower_bound, upper_bound):
        if lower_bound >= upper_bound:
            raise ValueError("Lower bound must be less than upper bound for continuous distribution.")
        self.a = lower_bound
        self.b = upper_bound
        self.density = 1.0 / (self.b - self.a)

    def pdf(self, x):
        """
        Probability Density Function (PDF) for a continuous uniform variable.
        p(x)
        """
        if self.a <= x <= self.b:
            return self.density
        else:
            return 0.0

    def cdf(self, x):
        """
        Cumulative Distribution Function (CDF) for a continuous uniform variable.
        P(X <= x)
        """
        if x < self.a:
            return 0.0
        elif x > self.b:
            return 1.0
        else:
            return (x - self.a) / (self.b - self.a)

    def prob_interval(self, start_interval, end_interval):
        """
        Calculates P(start_interval <= X <= end_interval) for a continuous variable.
        This is the integral of the PDF over the interval.
        """
        if start_interval >= end_interval:
            return 0.0
        
        # Adjust interval to be within the distribution's support
        actual_start = max(self.a, start_interval)
        actual_end = min(self.b, end_interval)

        if actual_start >= actual_end:
            return 0.0 # No overlap or invalid interval
            
        # For uniform distribution, integral is simply density * length of interval
        return self.density * (actual_end - actual_start)

# --- Demonstration and Contrasting Examples ---

print("--- Example 6.3: Contrasting Discrete and Continuous Uniform Distributions ---")
print("-" * 80)

# --- Discrete Uniform Distribution (Example 6.3a) ---
print("\n### Discrete Uniform Distribution (Z)")
discrete_states = [-1.1, 0.3, 1.5]
discrete_dist = DiscreteUniformDistribution(discrete_states)

print(f"States (Z): {discrete_dist.states}")
print(f"Number of states: {discrete_dist.num_states}")
print(f"Probability per state (1/{discrete_dist.num_states}): {discrete_dist.probability_per_state:.3f}")

print("\n--- Point Probabilities (P(Z = z)) ---")
for state in discrete_states:
    prob = discrete_dist.pmf(state)
    print(f"P(Z = {state}) = {prob:.3f}")

# Point probability for a state not in the distribution
print(f"P(Z = 10.0) = {discrete_dist.pmf(10.0):.3f} (for a state not in the set)")

# Illustrate total probability sums to 1
total_discrete_prob = sum(discrete_dist.get_all_pmf_values().values())
print(f"Sum of P(Z=z) for all states = {total_discrete_prob:.3f} (should be 1)")
print("-" * 80)


# --- Continuous Uniform Distribution (Example 6.3b) ---
print("\n### Continuous Uniform Distribution (X)")
continuous_lower = 0.9
continuous_upper = 1.6
continuous_dist = ContinuousUniformDistribution(continuous_lower, continuous_upper)

print(f"Range (X): [{continuous_dist.a}, {continuous_dist.b}]")
print(f"Density (1 / (b-a)): {continuous_dist.density:.3f}")

print("\n--- Point Densities (p(x)) ---")
# Pick some points within and outside the range
points_to_check_pdf = [0.5, 1.0, 1.5, 2.0]
for point in points_to_check_pdf:
    density_val = continuous_dist.pdf(point)
    print(f"p(X = {point}) (density) = {density_val:.3f}")

# Illustrate that density can be > 1
# Let's create another continuous distribution with a smaller range
narrow_continuous_dist = ContinuousUniformDistribution(0, 0.5) # Range length 0.5, density = 1/0.5 = 2.0
print(f"\n--- Illustrating density > 1 ---")
print(f"For Uniform(0, 0.5): Density p(x) = {narrow_continuous_dist.pdf(0.2):.3f} (which is > 1)")
# Check normalization for the narrow distribution
approx_integral_narrow = (narrow_continuous_dist.b - narrow_continuous_dist.a) * narrow_continuous_dist.density
print(f"Approximate integral for Uniform(0, 0.5): {approx_integral_narrow:.3f} (should be 1)")


print("\n--- Interval Probabilities (P(a <= X <= b)) ---")
interval1_start, interval1_end = 1.0, 1.2
prob1 = continuous_dist.prob_interval(interval1_start, interval1_end)
print(f"P({interval1_start} <= X <= {interval1_end}) = {prob1:.3f}")

interval2_start, interval2_end = 0.5, 0.8 # Outside range
prob2 = continuous_dist.prob_interval(interval2_start, interval2_end)
print(f"P({interval2_start} <= X <= {interval2_end}) = {prob2:.3f}")

interval3_start, interval3_end = 0.8, 1.7 # Overlapping range
prob3 = continuous_dist.prob_interval(interval3_start, interval3_end)
print(f"P({interval3_start} <= X <= {interval3_end}) = {prob3:.3f}")

print("\n--- Point Probability for Continuous Variable ---")
# As per text, P(X = x) for continuous variable is 0
point_x_continuous = 1.0
print(f"P(X = {point_x_continuous}) for a continuous variable is 0.")
print(f"The value p({point_x_continuous}) = {continuous_dist.pdf(point_x_continuous):.3f} is a density, not a probability.")

print("-" * 80)

print("\n### Summary of Contrasts (as per Table 6.1):")
print("- **Discrete (Z):**")
print(f"  - Point Probability (P(Z={discrete_states[0]})): {discrete_dist.pmf(discrete_states[0]):.3f} (actual probability)")
print("  - Interval Probability: Not directly applicable as continuous interval.")
print("- **Continuous (X):**")
print(f"  - Probability Density Function (p(X={points_to_check_pdf[1]})): {continuous_dist.pdf(points_to_check_pdf[1]):.3f} (a density, can be > 1)")
print(f"  - Point Probability (P(X={points_to_check_pdf[1]})): 0 (for any single point)")
print(f"  - Interval Probability (P({interval1_start} <= X <= {interval1_end})): {prob1:.3f} (calculated by integrating density)")

--- Example 6.3: Contrasting Discrete and Continuous Uniform Distributions ---
--------------------------------------------------------------------------------

### Discrete Uniform Distribution (Z)
States (Z): [-1.1, 0.3, 1.5]
Number of states: 3
Probability per state (1/3): 0.333

--- Point Probabilities (P(Z = z)) ---
P(Z = -1.1) = 0.333
P(Z = 0.3) = 0.333
P(Z = 1.5) = 0.333
P(Z = 10.0) = 0.000 (for a state not in the set)
Sum of P(Z=z) for all states = 1.000 (should be 1)
--------------------------------------------------------------------------------

### Continuous Uniform Distribution (X)
Range (X): [0.9, 1.6]
Density (1 / (b-a)): 1.429

--- Point Densities (p(x)) ---
p(X = 0.5) (density) = 0.000
p(X = 1.0) (density) = 1.429
p(X = 1.5) (density) = 1.429
p(X = 2.0) (density) = 0.000

--- Illustrating density > 1 ---
For Uniform(0, 0.5): Density p(x) = 2.000 (which is > 1)
Approximate integral for Uniform(0, 0.5): 1.000 (should be 1)

--- Interval Probabilities (P(a <= X <= b)) --

# 6.3 Sum Rule, Product Rule, and Bayes’ Theorem

We think of probability theory as an extension to logical reasoning. As we discussed in Section 6.1.1, the rules of probability presented here follow naturally from fulfilling the desiderata (Jaynes, 2003, chapter 2). Probabilistic modeling (Section 8.4) provides a principled foundation for designing machine learning methods. Once we have defined probability distributions (Section 6.2) corresponding to the uncertainties of the data and our problem, it turns out that there are only two fundamental rules, the sum rule and the product rule.

Recall from (6.9) that $p(x, y)$ is the joint distribution of the two random variables $x, y$. The distributions $p(x)$ and $p(y)$ are the corresponding marginal distributions, and $p(y | x)$ is the conditional distribution of $y$ given $x$. Given the definitions of the marginal and conditional probability for discrete and continuous random variables in Section 6.2, we can now present the two fundamental rules in probability theory.

The first rule, the **sum rule**, states that:

$$ p(x) = \begin{cases} \sum_{y \in \mathcal{Y}} p(x, y) & \text{if } y \text{ is discrete} \\ \int_{\mathcal{Y}} p(x, y) dy & \text{if } y \text{ is continuous} \end{cases} \quad \text{(6.20)}$$

where $\mathcal{Y}$ are the states of the target space of random variable $Y$. This means that we sum out (or integrate out) the set of states $y$ of the random variable $Y$. The sum rule is also known as the **marginalization property**. The sum rule relates the joint distribution to a marginal distribution. In general, when the joint distribution contains more than two random variables, the sum rule can be applied to any subset of the random variables, resulting in a marginal distribution of potentially more than one random variable. More concretely, if $\mathbf{x} = [x_1, \ldots, x_D]^\top$, we obtain the marginal:

$$p(x_i) = \int p(x_1, \ldots, x_D) d\mathbf{x}_{\setminus i} \quad \text{(6.21)}$$

by repeated application of the sum rule where we integrate/sum out all random variables except $x_i$, which is indicated by $\setminus i$, which reads “all except $i$.”

**Remark.** Many of the computational challenges of probabilistic modeling are due to the application of the sum rule. When there are many variables or discrete variables with many states, the sum rule boils down to performing a high-dimensional sum or integral. Performing high-dimensional sums or integrals is generally computationally hard, in the sense that there is no known polynomial-time algorithm to calculate them exactly. $\diamondsuit$

The second rule, known as the **product rule**, relates the joint distribution to the conditional distribution via:

$$p(x, y) = p(y | x)p(x) \quad \text{(6.22)}$$

The product rule can be interpreted as the fact that every joint distribution of two random variables can be factorized (written as a product) of two other distributions. The two factors are the marginal distribution of the first random variable $p(x)$, and the conditional distribution of the second random variable given the first $p(y | x)$. Since the ordering of random variables is arbitrary in $p(x, y)$, the product rule also implies $p(x, y) = p(x | y)p(y)$. To be precise, (6.22) is expressed in terms of the probability mass functions for discrete random variables. For continuous random variables, the product rule is expressed in terms of the probability density functions (Section 6.2.3).

In machine learning and Bayesian statistics, we are often interested in making inferences of unobserved (latent) random variables given that we have observed other random variables. Let us assume we have some prior knowledge $p(x)$ about an unobserved random variable $x$ and some relationship $p(y | x)$ between $x$ and a second random variable $y$, which we can observe. If we observe $y$, we can use Bayes’ theorem to draw some conclusions about $x$ given the observed values of $y$.

**Bayes’ theorem** (also Bayes’ rule or Bayes’ law):

$$p(x | y) = \frac{\overbrace{p(y | x)}^{\text{likelihood}} \overbrace{p(x)}^{\text{prior}}}{\underbrace{p(y)}_{\text{evidence}}} \quad \text{(6.23)}$$

is a direct consequence of the product rule in (6.22) since:

$$p(x, y) = p(x | y)p(y) \quad \text{(6.24)}$$

and

$$p(x, y) = p(y | x)p(x) \quad \text{(6.25)}$$

so that:

$$p(x | y)p(y) = p(y | x)p(x) \quad \iff \quad p(x | y) = \frac{p(y | x)p(x)}{p(y)} \quad \text{(6.26)}$$

In (6.23), $p(x)$ is the **prior**, which encapsulates our subjective prior knowledge of the unobserved (latent) variable $x$ before observing any data. We can choose any prior that makes sense to us, but it is critical to ensure that the prior has a nonzero pdf (or pmf) on all plausible $x$, even if they are very rare. The likelihood $p(y | x)$ describes how $x$ and $y$ are related, and in the case of discrete probability distributions, it is the probability of the data $y$ if we were to know the latent variable $x$. Note that the likelihood is not a distribution in $x$, but only in $y$. We call $p(y | x)$ either the “likelihood of $x$ (given $y$)” or the “probability of $y$ given $x$” but never the likelihood of $y$ (MacKay, 2003). The **posterior** $p(x | y)$ is the quantity of interest in Bayesian statistics because it expresses exactly what we are interested in, i.e., what we know about $x$ after having observed $y$.

In [5]:
# --- 1. Define the Joint Probability Distribution P(X, Y) ---
# This is our starting point, representing p(x, y)
# The keys are tuples (weather, activity) and values are probabilities.
# These probabilities must sum to 1.0.

joint_prob_table = {
    ('Sunny', 'Hiking'): 0.30,
    ('Sunny', 'Shopping'): 0.10,
    ('Cloudy', 'Hiking'): 0.15,
    ('Cloudy', 'Shopping'): 0.25,
    ('Rainy', 'Hiking'): 0.05,
    ('Rainy', 'Shopping'): 0.15
}

# Extract unique states for X and Y for easier iteration
x_states = sorted(list(set(item[0] for item in joint_prob_table.keys())))
y_states = sorted(list(set(item[1] for item in joint_prob_table.keys())))

print("--- 1. Joint Probability Distribution P(X, Y) ---")
print("Joint Probabilities:")
for (x, y), prob in joint_prob_table.items():
    print(f"P(X={x}, Y={y}) = {prob:.2f}")

total_joint_prob = sum(joint_prob_table.values())
print(f"Sum of joint probabilities: {total_joint_prob:.2f} (should be 1.0)")
if abs(total_joint_prob - 1.0) < 1e-9:
    print("Joint probabilities are normalized.")
else:
    print("Error: Joint probabilities do not sum to 1.0.")
print("-" * 50)


# --- 2. Implement the Sum Rule (Marginalization) ---
# p(x) = sum_y p(x, y)
# p(y) = sum_x p(x, y)

marginal_px = {}
for x in x_states:
    marginal_px[x] = sum(prob for (wx, wy), prob in joint_prob_table.items() if wx == x)

marginal_py = {}
for y in y_states:
    marginal_py[y] = sum(prob for (wx, wy), prob in joint_prob_table.items() if wy == y)

print("--- 2. Sum Rule (Marginal Probabilities) ---")
print("P(X) - Marginal Probability of Weather:")
for x, prob in marginal_px.items():
    print(f"P(X={x}) = {prob:.2f}")
print(f"Sum P(X): {sum(marginal_px.values()):.2f}")

print("\nP(Y) - Marginal Probability of Activity:")
for y, prob in marginal_py.items():
    print(f"P(Y={y}) = {prob:.2f}")
print(f"Sum P(Y): {sum(marginal_py.values()):.2f}")
print("-" * 50)


# --- 3. Implement the Product Rule ---
# p(x, y) = p(y | x) * p(x)  OR  p(x, y) = p(x | y) * p(y)
# From this, we can derive conditional probabilities:
# p(y | x) = p(x, y) / p(x)
# p(x | y) = p(x, y) / p(y)

conditional_py_given_x = {}
print("--- 3. Product Rule (Conditional Probabilities) ---")
print("P(Y | X) - Probability of Activity given Weather:")
for x in x_states:
    if marginal_px[x] == 0:
        print(f"Cannot compute P(Y | X={x}) as P(X={x}) is zero.")
        continue
    for y in y_states:
        joint_prob = joint_prob_table[(x, y)]
        cond_prob = joint_prob / marginal_px[x]
        conditional_py_given_x[(y, x)] = cond_prob
        print(f"P(Y={y} | X={x}) = {cond_prob:.2f}")
    # Verify sum for each condition
    current_sum = sum(conditional_py_given_x[(y, x)] for y in y_states)
    print(f"  Sum for X={x}: {current_sum:.2f} (should be 1.0)")

print("\nP(X | Y) - Probability of Weather given Activity:")
conditional_px_given_y = {}
for y in y_states:
    if marginal_py[y] == 0:
        print(f"Cannot compute P(X | Y={y}) as P(Y={y}) is zero.")
        continue
    for x in x_states:
        joint_prob = joint_prob_table[(x, y)]
        cond_prob = joint_prob / marginal_py[y]
        conditional_px_given_y[(x, y)] = cond_prob
        print(f"P(X={x} | Y={y}) = {cond_prob:.2f}")
    # Verify sum for each condition
    current_sum = sum(conditional_px_given_y[(x, y)] for x in x_states)
    print(f"  Sum for Y={y}: {current_sum:.2f} (should be 1.0)")
print("-" * 50)


# --- 4. Implement Bayes' Theorem ---
# P(X | Y) = [P(Y | X) * P(X)] / P(Y)
# Let's calculate P(X='Sunny' | Y='Shopping') using Bayes' Theorem
# We need:
#   P(Y='Shopping' | X='Sunny') (likelihood)
#   P(X='Sunny') (prior)
#   P(Y='Shopping') (evidence)

# Example: What's the probability it was Sunny given that someone went Shopping?
target_x = 'Sunny'
observed_y = 'Shopping'

# Get components for Bayes' Theorem
prior_px = marginal_px[target_x]
likelihood_py_given_x = conditional_py_given_x[(observed_y, target_x)]
evidence_py = marginal_py[observed_y]

print("--- 4. Bayes' Theorem ---")
print(f"Calculating P(X='{target_x}' | Y='{observed_y}') using Bayes' Theorem:")
print(f"  Prior P(X='{target_x}'): {prior_px:.2f}")
print(f"  Likelihood P(Y='{observed_y}' | X='{target_x}'): {likelihood_py_given_x:.2f}")
print(f"  Evidence P(Y='{observed_y}'): {evidence_py:.2f}")

if evidence_py == 0:
    print("Cannot apply Bayes' Theorem: Evidence P(Y) is zero.")
else:
    posterior_px_given_y_bayes = (likelihood_py_given_x * prior_px) / evidence_py
    print(f"  Posterior P(X='{target_x}' | Y='{observed_y}') = ({likelihood_py_given_x} * {prior_px}) / {evidence_py} = {posterior_px_given_y_bayes:.2f}")

    # Verify with direct calculation from product rule (should match)
    direct_px_given_y = conditional_px_given_y[(target_x, observed_y)]
    print(f"  (Directly calculated P(X='{target_x}' | Y='{observed_y}') = {direct_px_given_y:.2f})")

    if abs(posterior_px_given_y_bayes - direct_px_given_y) < 1e-9:
        print("  Bayes' Theorem calculation matches direct conditional probability (consistent).")
    else:
        print("  Error: Bayes' Theorem calculation does NOT match direct conditional probability.")
print("-" * 50)


# --- General Function for Bayes' Theorem ---
def bayes_theorem(prior_dist, likelihood_dist, evidence_dist, x_val, y_val):
    """
    Applies Bayes' Theorem for discrete variables.
    P(X=x_val | Y=y_val) = [P(Y=y_val | X=x_val) * P(X=x_val)] / P(Y=y_val)

    Args:
        prior_dist (dict): Dictionary of P(X) for all x states.
        likelihood_dist (dict): Dictionary of P(Y|X) for all (y,x) pairs.
                                Keys are (y_state, x_state).
        evidence_dist (dict): Dictionary of P(Y) for all y states.
        x_val (str): The specific x state for which to calculate the posterior.
        y_val (str): The specific y state that was observed.

    Returns:
        float: The posterior probability P(X=x_val | Y=y_val), or None if evidence is zero.
    """
    prior = prior_dist.get(x_val)
    likelihood = likelihood_dist.get((y_val, x_val))
    evidence = evidence_dist.get(y_val)

    if prior is None:
        print(f"Error: Prior for X='{x_val}' not found.")
        return None
    if likelihood is None:
        print(f"Error: Likelihood for Y='{y_val}' given X='{x_val}' not found.")
        return None
    if evidence is None:
        print(f"Error: Evidence for Y='{y_val}' not found.")
        return None
    
    if evidence == 0:
        print(f"Error: Evidence P(Y='{y_val}') is zero, cannot apply Bayes' Theorem.")
        return None
    
    return (likelihood * prior) / evidence

print("\n--- Bayes' Theorem Function Demonstration ---")
# Let's try another example: P(X='Rainy' | Y='Hiking')
target_x_2 = 'Rainy'
observed_y_2 = 'Hiking'

posterior_2 = bayes_theorem(marginal_px, conditional_py_given_x, marginal_py, target_x_2, observed_y_2)

if posterior_2 is not None:
    print(f"P(X='{target_x_2}' | Y='{observed_y_2}') (using function) = {posterior_2:.2f}")
    # Verify
    direct_2 = conditional_px_given_y[(target_x_2, observed_y_2)]
    print(f"(Directly calculated P(X='{target_x_2}' | Y='{observed_y_2}') = {direct_2:.2f})")
    if abs(posterior_2 - direct_2) < 1e-9:
        print("Function result matches direct calculation.")
    else:
        print("Function result does NOT match direct calculation.")

--- 1. Joint Probability Distribution P(X, Y) ---
Joint Probabilities:
P(X=Sunny, Y=Hiking) = 0.30
P(X=Sunny, Y=Shopping) = 0.10
P(X=Cloudy, Y=Hiking) = 0.15
P(X=Cloudy, Y=Shopping) = 0.25
P(X=Rainy, Y=Hiking) = 0.05
P(X=Rainy, Y=Shopping) = 0.15
Sum of joint probabilities: 1.00 (should be 1.0)
Joint probabilities are normalized.
--------------------------------------------------
--- 2. Sum Rule (Marginal Probabilities) ---
P(X) - Marginal Probability of Weather:
P(X=Cloudy) = 0.40
P(X=Rainy) = 0.20
P(X=Sunny) = 0.40
Sum P(X): 1.00

P(Y) - Marginal Probability of Activity:
P(Y=Hiking) = 0.50
P(Y=Shopping) = 0.50
Sum P(Y): 1.00
--------------------------------------------------
--- 3. Product Rule (Conditional Probabilities) ---
P(Y | X) - Probability of Activity given Weather:
P(Y=Hiking | X=Cloudy) = 0.37
P(Y=Shopping | X=Cloudy) = 0.62
  Sum for X=Cloudy: 1.00 (should be 1.0)
P(Y=Hiking | X=Rainy) = 0.25
P(Y=Shopping | X=Rainy) = 0.75
  Sum for X=Rainy: 1.00 (should be 1.0)
P(Y=Hikin

##  Sum Rule, Product Rule, and Bayes’ Theorem (Continued)

The quantity:

$$p(y) := \int p(y | x)p(x)dx = \mathbb{E}_X [p(y | x)] \quad \text{(6.27)}$$

is the **marginal likelihood** / **evidence**. The right-hand side of (6.27) uses the expectation operator which we define in Section 6.4.1. By definition, the marginal likelihood integrates the numerator of (6.23) with respect to the latent variable $x$. Therefore, the marginal likelihood is independent of $x$, and it ensures that the posterior $p(x | y)$ is normalized. The marginal likelihood can also be interpreted as the expected likelihood where we take the expectation with respect to the prior $p(x)$. Beyond normalization of the posterior, the marginal likelihood also plays an important role in Bayesian model selection, as we will discuss in Section 8.6. Due to the integration in (8.44), the evidence is often hard to compute.

Bayes’ theorem (6.23) allows us to invert the relationship between $x$ and $y$ given by the likelihood. Therefore, Bayes’ theorem is sometimes called the **probabilistic inverse**. We will discuss Bayes’ theorem further in Section 8.4.

**Remark.** In Bayesian statistics, the posterior distribution is the quantity of interest as it encapsulates all available information from the prior and the data. Instead of carrying the posterior around, it is possible to focus on some statistic of the posterior, such as the maximum of the posterior, which we will discuss in Section 8.3. However, focusing on some statistic of the posterior leads to loss of information. If we think in a bigger context, then the posterior can be used within a decision-making system, and having the full posterior can be extremely useful and lead to decisions that are robust to disturbances. For example, in the context of model-based reinforcement learning, Deisenroth et al. (2015) show that using the full posterior distribution of plausible transition functions leads to very fast (data/sample efficient) learning, whereas focusing on the maximum of the posterior leads to consistent failures. Therefore, having the full posterior can be very useful for a downstream task. In Chapter 9, we will continue this discussion in the context of linear regression. $\diamondsuit$

##  Summary Statistics and Independence

We are often interested in summarizing sets of random variables and comparing pairs of random variables. A **statistic** of a random variable is a deterministic function of that random variable. The **summary statistics** of a distribution provide one useful view of how a random variable behaves, and as the name suggests, provide numbers that summarize and characterize the distribution. We describe the mean and the variance, two well-known summary statistics. Then we discuss two ways to compare a pair of random variables: first, how to say that two random variables are independent; and second, how to compute an inner product between them.

In [6]:
# --- Reuse the Joint Probability Distribution and Marginals from previous example ---

joint_prob_table = {
    ('Sunny', 'Hiking'): 0.30,
    ('Sunny', 'Shopping'): 0.10,
    ('Cloudy', 'Hiking'): 0.15,
    ('Cloudy', 'Shopping'): 0.25,
    ('Rainy', 'Hiking'): 0.05,
    ('Rainy', 'Shopping'): 0.15
}

x_states = sorted(list(set(item[0] for item in joint_prob_table.keys())))
y_states = sorted(list(set(item[1] for item in joint_prob_table.keys())))

# Calculate marginals (Sum Rule) - these are needed for evidence and independence check
marginal_px = {}
for x in x_states:
    marginal_px[x] = sum(prob for (wx, wy), prob in joint_prob_table.items() if wx == x)

marginal_py = {}
for y in y_states:
    marginal_py[y] = sum(prob for (wx, wy), prob in joint_prob_table.items() if wy == y)

# Calculate conditional P(Y | X) - needed for Bayes' Theorem interpretation
conditional_py_given_x = {}
for x in x_states:
    if marginal_px[x] == 0:
        continue
    for y in y_states:
        joint_prob = joint_prob_table[(x, y)]
        cond_prob = joint_prob / marginal_px[x]
        conditional_py_given_x[(y, x)] = cond_prob


print("--- 1. Marginal Likelihood / Evidence (for Discrete Variables) ---")
# For discrete variables, p(y) = sum_x p(y|x)p(x)
# We already calculated P(Y) which is the evidence.
# Let's demonstrate it by manually calculating one for 'Shopping'.

observed_y_val = 'Shopping'
evidence_calc = 0.0
print(f"Calculating P(Y='{observed_y_val}') as Evidence using sum_x P(Y='{observed_y_val}'|X=x) * P(X=x):")
for x_state in x_states:
    prior_x = marginal_px[x_state]
    likelihood_y_given_x = conditional_py_given_x.get((observed_y_val, x_state), 0) # Use .get with default 0 in case of missing key
    term = likelihood_y_given_x * prior_x
    evidence_calc += term
    print(f"  P(Y='{observed_y_val}'|X='{x_state}') * P(X='{x_state}') = {likelihood_y_given_x:.2f} * {prior_x:.2f} = {term:.2f}")

print(f"Marginal Likelihood / Evidence P(Y='{observed_y_val}') = {evidence_calc:.2f}")
print(f"(Matches previously calculated marginal P(Y='{observed_y_val}') = {marginal_py.get(observed_y_val, 0):.2f})")
print("-" * 50)


# --- 2. Summary Statistics: Mean (Expected Value) ---
# For discrete random variable X with states x_i and probabilities p(x_i):
# E[X] = sum_i x_i * p(x_i)
# This example requires numerical states for X. Let's create a new discrete variable.

# Example: X = number of heads in 2 coin tosses (from Section 6.1.2)
# States: {0, 1, 2}
# Probabilities: P(X=0)=0.49, P(X=1)=0.42, P(X=2)=0.09 (from Example 6.1)
coin_toss_x_states = [0, 1, 2]
coin_toss_pmf = {0: 0.49, 1: 0.42, 2: 0.09}

def calculate_mean(states, pmf):
    """Calculates the mean (expected value) for a discrete random variable."""
    expected_value = 0.0
    for state in states:
        expected_value += state * pmf.get(state, 0)
    return expected_value

print("\n--- 2. Summary Statistics: Mean (Expected Value) ---")
mean_coin_toss = calculate_mean(coin_toss_x_states, coin_toss_pmf)
print(f"Random Variable X (Number of Heads): States={coin_toss_x_states}, PMF={coin_toss_pmf}")
print(f"Mean E[X] = {mean_coin_toss:.3f}")
print("-" * 50)


# --- 3. Summary Statistics: Variance ---
# Var(X) = E[(X - E[X])^2] = sum_i (x_i - E[X])^2 * p(x_i)

def calculate_variance(states, pmf, mean_val):
    """Calculates the variance for a discrete random variable."""
    variance = 0.0
    for state in states:
        variance += ((state - mean_val)**2) * pmf.get(state, 0)
    return variance

print("\n--- 3. Summary Statistics: Variance ---")
variance_coin_toss = calculate_variance(coin_toss_x_states, coin_toss_pmf, mean_coin_toss)
print(f"Variance Var(X) = {variance_coin_toss:.3f}")
print("-" * 50)


# --- 4. Independence ---
# Two random variables X and Y are independent if and only if:
# P(X=x, Y=y) = P(X=x) * P(Y=y) for all x and y.

print("\n--- 4. Independence Check ---")
print("Checking if Weather (X) and Activity (Y) are independent:")
are_independent = True
for x in x_states:
    for y in y_states:
        joint = joint_prob_table[(x, y)]
        product_marginals = marginal_px[x] * marginal_py[y]
        
        print(f"  P(X={x}, Y={y}) = {joint:.2f}")
        print(f"  P(X={x}) * P(Y={y}) = {marginal_px[x]:.2f} * {marginal_py[y]:.2f} = {product_marginals:.2f}")
        
        if abs(joint - product_marginals) > 1e-9: # Use a small tolerance for float comparison
            print(f"  --> Mismatch found for (X={x}, Y={y}). Joint({joint:.2f}) != Product Marginals({product_marginals:.2f})")
            are_independent = False
            # We can break here since we found a single case of non-independence
            break
    if not are_independent:
        break

if are_independent:
    print("\nConclusion: X and Y are independent.")
else:
    print("\nConclusion: X and Y are NOT independent.")
    print("This is expected given our sample data, as weather likely influences activity choices.")

print("-" * 50)

--- 1. Marginal Likelihood / Evidence (for Discrete Variables) ---
Calculating P(Y='Shopping') as Evidence using sum_x P(Y='Shopping'|X=x) * P(X=x):
  P(Y='Shopping'|X='Cloudy') * P(X='Cloudy') = 0.62 * 0.40 = 0.25
  P(Y='Shopping'|X='Rainy') * P(X='Rainy') = 0.75 * 0.20 = 0.15
  P(Y='Shopping'|X='Sunny') * P(X='Sunny') = 0.25 * 0.40 = 0.10
Marginal Likelihood / Evidence P(Y='Shopping') = 0.50
(Matches previously calculated marginal P(Y='Shopping') = 0.50)
--------------------------------------------------

--- 2. Summary Statistics: Mean (Expected Value) ---
Random Variable X (Number of Heads): States=[0, 1, 2], PMF={0: 0.49, 1: 0.42, 2: 0.09}
Mean E[X] = 0.600
--------------------------------------------------

--- 3. Summary Statistics: Variance ---
Variance Var(X) = 0.420
--------------------------------------------------

--- 4. Independence Check ---
Checking if Weather (X) and Activity (Y) are independent:
  P(X=Cloudy, Y=Hiking) = 0.15
  P(X=Cloudy) * P(Y=Hiking) = 0.40 * 0.50 