In [None]:
'''
 * Copyright (c) 2008 Radhamadhab Dalai
 *
 * Permission is hereby granted, free of charge, to any person obtaining a copy
 * of this software and associated documentation files (the "Software"), to deal
 * in the Software without restriction, including without limitation the rights
 * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 * copies of the Software, and to permit persons to whom the Software is
 * furnished to do so, subject to the following conditions:
 *
 * The above copyright notice and this permission notice shall be included in
 * all copies or substantial portions of the Software.
 *
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 * THE SOFTWARE.
'''

##  Monotone Covariance and Rao-Blackwellization

In the spirit of the Duality Principle, Rao-Blackwellization exhibits an interesting difference between statistical perspectives and simulation practice, in the sense that the approximations used in the estimator do not (directly) involve the chain of interest. As shown in Section 4.2 and Section 7.6.2, conditioning on a subset of the simulated variables may produce considerable improvement upon the standard empirical estimator in terms of variance, by a simple "recycling" of the rejected variables (see also Section 3.3.3). Two-stage Gibbs sampling and its generalization of Chapter 10 do not permit this kind of recycling since every simulated value is accepted (Theorem 10.13). Nonetheless, Gelfand and Smith (1990) propose a type of conditioning christened Rao-Blackwellization in connection with the Rao-Blackwell Theorem (see Lehmann and Casella 1998, Section 1.7) and defined as parametric Rao-Blackwellization by Casella and Robert (1996) to differentiate from the form studied in Sections 4.2 and 7.6.2.

For $\mathbf{Y} = (Y_1, Y_2) \sim g(y_1, y_2)$, Rao-Blackwellization is based on the marginalization identity

$$
g^{(1)}(y_1) = \int g_1(y_1|y_2) g^{(2)}(y_2) dy_2.
$$

It thus replaces
$$
\delta_0 = \frac{1}{T} \sum_{t=1}^T h(y_1^{(t)})
$$
with
$$
\delta_{rb} = \frac{1}{T} \sum_{t=1}^T \mathbb{E}[h(Y_1)|y_2^{(t)}].
$$

Both estimators converge to $\mathbb{E}[h(Y_1)]$ and, under the stationary distribution, they are both unbiased. An application of the identity $\text{var}(U) = \text{var}(\mathbb{E}[U|V]) + \mathbb{E}[\text{var}(U|V)]$ implies that
$$
\text{var}(\mathbb{E}[h(Y_1)|Y_2]) \le \text{var}(h(Y_1)). \quad (9.9)
$$

This led Gelfand and Smith (1990) to suggest the use of $\delta_{rb}$ instead of $\delta_0$. However, inequality (9.9) is insufficient to conclude on the domination of $\delta_{rb}$ when compared with $\delta_0$, as it fails to take into account the correlation between the $Y^{(t)}$'s. The domination of $\delta_0$ by $\delta_{rb}$ can therefore be established in only a few cases; Liu et al. (1994) show in particular that it holds for the two-stage Gibbs sampler. (See also Geyer 1995 for necessary conditions.)

We establish the domination result in Theorem 9.19, but we first need some preliminary results, beginning with a representation lemma yielding the interesting result that covariances are positive in an interleaved chain.

**Lemma 9.17.** If $h \in \mathcal{L}_2(g_2)$ and if $(X^{(t)})$ is interleaved with $(Y^{(t)})$, then
$$
\text{cov}(h(Y^{(1)}), h(Y^{(2)})) = \text{var}(\mathbb{E}[h(Y)|X]).
$$

**Proof.** Assuming, without loss of generality, that $\mathbb{E}_{g_2}[h(Y)] = 0$,
$$
\begin{aligned}
\text{cov}(h(Y^{(1)}), h(Y^{(2)})) &= \mathbb{E}[h(Y^{(1)}) h(Y^{(2)})] \\
&= \mathbb{E} \left[ \mathbb{E}[h(Y^{(1)})|X^{(2)}] \mathbb{E}[h(Y^{(2)})|X^{(2)}] \right] \\
&= \mathbb{E} \left[ \mathbb{E}[h(Y^{(1)})|X^{(2)}]^2 \right] = \text{var}(\mathbb{E}[h(Y)|X]),
\end{aligned}
$$
where the second equality follows from iterating the expectation and using the conditional independence of the interleaving property. The last equality uses reversibility (that is, condition (iii)) of the interleaved chains. $\Box$

**Proposition 9.18.** If $(Y^{(t)})$ is a Markov chain with the interleaving property, the covariances
$$
\text{cov}(h(Y^{(1)}), h(Y^{(t)}))
$$
are positive and decreasing in $t$ for every $h \in \mathcal{L}_2(g_2)$.$$
g^{(1)}(y_1) = \int g_1(y_1|y_2) g^{(2)}(y_2) dy_2.
$$

It thus replaces
$$
\delta_0 = \frac{1}{T} \sum_{t=1}^T h(y_1^{(t)})
$$
with
$$
\delta_{rb} = \frac{1}{T} \sum_{t=1}^T \mathbb{E}[h(Y_1)|y_2^{(t)}].
$$

Both estimators converge to $\mathbb{E}[h(Y_1)]$ and, under the stationary distribution, they are both unbiased. An application of the identity $\text{var}(U) = \text{var}(\mathbb{E}[U|V]) + \mathbb{E}[\text{var}(U|V)]$ implies that
$$
\text{var}(\mathbb{E}[h(Y_1)|Y_2]) \le \text{var}(h(Y_1)). \quad (9.9)
$$

This led Gelfand and Smith (1990) to suggest the use of $\delta_{rb}$ instead of $\delta_0$. However, inequality (9.9) is insufficient to conclude on the domination of $\delta_{rb}$ when compared with $\delta_0$, as it fails to take into account the correlation between the $Y^{(t)}$'s. The domination of $\delta_0$ by $\delta_{rb}$ can therefore be established in only a few cases; Liu et al. (1994) show in particular that it holds for the two-stage Gibbs sampler. (See also Geyer 1995 for necessary conditions.)

We establish the domination result in Theorem 9.19, but we first need some preliminary results, beginning with a representation lemma yielding the interesting result that covariances are positive in an interleaved chain.



**Lemma 9.17.** If $h \in \mathcal{L}_2(g_2)$ and if $(X^{(t)})$ is interleaved with $(Y^{(t)})$, then
$$
\text{cov}(h(Y^{(1)}), h(Y^{(2)})) = \text{var}(\mathbb{E}[h(Y)|X]).
$$

**Proof.** Assuming, without loss of generality, that $\mathbb{E}_{g_2}[h(Y)] = 0$,
$$
\begin{aligned}
\text{cov}(h(Y^{(1)}), h(Y^{(2)})) &= \mathbb{E}[h(Y^{(1)}) h(Y^{(2)})] \\
&= \mathbb{E} \left[ \mathbb{E}[h(Y^{(1)})|X^{(2)}] \mathbb{E}[h(Y^{(2)})|X^{(2)}] \right] \\
&= \mathbb{E} \left[ \mathbb{E}[h(Y^{(1)})|X^{(2)}]^2 \right] = \text{var}(\mathbb{E}[h(Y)|X]),
\end{aligned}
$$
where the second equality follows from iterating the expectation and using the conditional independence of the interleaving property. The last equality uses reversibility (that is, condition (iii)) of the interleaved chains. $\Box$

**Proposition 9.18.** If $(Y^{(t)})$ is a Markov chain with the interleaving property, the covariances
$$
\text{cov}(h(Y^{(1)}), h(Y^{(t)}))
$$
are positive and decreasing in $t$ for every $h \in \mathcal{L}_2(g_2)$.

**Proof.** Lemma 9.17 implies, by induction, that
$$
\text{cov}(h(Y^{(1)}), h(Y^{(t)})) = \mathbb{E} \left[ \mathbb{E}[h(Y)|X^{(2)}] \mathbb{E}[h(Y)|X^{(t)}] \right]
$$
$$
\text{(9.10)} \quad = \text{var} \left( \mathbb{E} \left[ \mathbb{E} \left[ ... \mathbb{E}[h(Y)|X] | Y \right] ... \right] \right),
$$
where the last term involves $(t-1)$ conditional expectations, alternatively in $Y$ and in $X$. The decrease in $t$ directly follows from the inequality on conditional expectations, by virtue of the representation (9.10) and the inequality (9.9). $\Box$

The result on the improvement brought by Rao-Blackwellization then easily follows from Proposition 9.18.

**Theorem 9.19.** If $(X^{(t)})$ and $(Y^{(t)})$ are two interleaved Markov chains, with stationary distributions $f_X$ and $f_Y$ respectively, the estimator $\delta_{rb}$ dominates the estimator $\delta_0$ for every function $h$ with finite variance under both $f_X$ and $f_Y$.

**Proof.** Again assuming $\mathbb{E}[h(X)] = 0$, and introducing the estimators
$$
\text{(9.11)} \quad \delta_0 = \frac{1}{T} \sum_{t=1}^T h(X^{(t)}), \quad \delta_{rb} = \frac{1}{T} \sum_{t=1}^T \mathbb{E}[h(X)|Y^{(t)}],
$$
it follows that
$$
\text{var}(\delta_0) = \frac{1}{T^2} \sum_{t, t'} \text{cov}(h(X^{(t)}), h(X^{(t')}))
$$
$$
\text{(9.12)} \quad = \frac{1}{T^2} \sum_{t, t'} \text{var} \left( \mathbb{E} \left[ ... \mathbb{E}[h(X)|Y] ... \right] \right)
$$
and
$$
\text{var}(\delta_{rb}) = \frac{1}{T^2} \sum_{t, t'} \text{cov}(\mathbb{E}[h(X)|Y^{(t)}], \mathbb{E}[h(X)|Y^{(t')}])
$$
$$
\text{(9.13)} \quad = \frac{1}{T^2} \sum_{t, t'} \text{var} \left( \mathbb{E} \left[ ... \mathbb{E}[\mathbb{E}[h(X)|Y]|X] ... \right] \right),
$$
according to the proof of Proposition 9.18, with $|t - t'|$ conditional expectations in the general term of (9.12) and $|t - t'| + 1$ in the general term of (9.13). It is then sufficient to compare $\text{var}(\delta_0)$ term by term to conclude that $\text{var}(\delta_0) \ge \text{var}(\delta_{rb})$. $\Box$

One might question whether Rao-Blackwellization will always result in an appreciable variance reduction, even as the sample size (or the number of Monte Carlo iterations) increases. This point was addressed by Levine (1996),
who formulated this problem in terms of the *asymptotic relative efficiency* (ARE) of $\delta_0$ with respect to its Rao-Blackwellized version $\delta_{rb}$, given in (9.11), where the pairs $(X^{(t)}, Y^{(t)})$ are generated from a bivariate Gibbs sampler. The ARE is a ratio of the variances of the limiting distributions for the two estimators, which are given by
$$
\text{(9.14)} \quad \sigma_0^2 = \text{var}(h(X^{(0)})) + 2 \sum_{k=1}^{\infty} \text{cov}(h(X^{(0)}), h(X^{(k)}))
$$
and
$$
\sigma_{rb}^2 = \text{var}(\mathbb{E}[h(X)|Y])
$$
$$
\text{(9.15)} \quad + 2 \sum_{k=1}^{\infty} \text{cov}(\mathbb{E}[h(X)|Y^{(0)}], \mathbb{E}[h(X)|Y^{(k)}]).
$$

Levine (1996) established that the ratio $\sigma_0^2 / \sigma_{rb}^2 \ge 1$, with equality if and only if $\text{var}(h(X)) = \text{cov}(\mathbb{E}[h(X)|Y]) = 0$.

**Example 9.20.** (Continuation of Example 9.1) For the Gibbs sampler (9.4), it can be shown (Problem 9.5) that $\text{cov}(X^{(0)}, X^{(k)}) = \rho^{2k}$, for all $k$, and
$$
\sigma_0^2 / \sigma_{rb}^2 = \frac{1}{1 - \rho^2} > 1.
$$

So, if $\rho$ is small, the amount of improvement, which is independent of the number of iterations, can be substantial.

In [None]:
import random
import matplotlib.pyplot as plt

# 1. Define conditional distributions (replace with your actual distributions)
def pi_x_given_y(x, y):
    """Conditional distribution of X given Y (example)."""
    if y == 0:
        return [0.1 * (2.71828**(-((xi)**2) / 2)) for xi in x]  # Example: Normal-like distribution
    else:
        return [0.1 * (2.71828**(-((xi - 3)**2) / 2)) for xi in x] # Example: Shifted normal-like

def f_y_given_x_y(y_new, x, y_old, rho):  # Add rho parameter
    """Conditional distribution of Y, controlled by rho."""
    if y_old == 0:
        return rho if y_new == 0 else (1 - rho)
    else:
        return (1 - rho) if y_new == 0 else rho

# 2. Dual chain sampler with convergence tracking
def dual_chain_sampler(num_samples, y_initial, rho):
    x_samples = []
    y = y_initial
    x_range = [-5 + i * 0.1 for i in range(101)]
    # x_means = []  # Track mean of x samples  (Not needed for ARE calculation)

    for _ in range(num_samples):
        # Sample x given y
        probabilities = pi_x_given_y(x_range, y)
        total_prob = sum(probabilities)
        normalized_probs = [p / total_prob for p in probabilities]
        x = random.choices(x_range, weights=normalized_probs)[0]  # Sample using weights

        # Sample y given x and old y (using rho)
        y_new = random.choices([0, 1], weights=[f_y_given_x_y(0, x, y, rho), 1-f_y_given_x_y(0, x, y, rho)])[0]
        y = y_new
        x_samples.append(x)
        # x_means.append(sum(x_samples) / len(x_samples)) # Calculate mean at each step
    return x_samples

def calculate_variances(samples_original, samples_rb):
    """Calculates variances of the original and RB estimators."""
    var_original = sum([(x - sum(samples_original) / len(samples_original))**2 for x in samples_original]) / len(samples_original)
    var_rb = sum([(x - sum(samples_rb) / len(samples_rb))**2 for x in samples_rb]) / len(samples_rb)
    return var_original, var_rb

# 3. Experiment with different rho values and calculate variances
num_samples = 1000
y_initial = 0
x_range = [-5 + i * 0.1 for i in range(101)]

rho_values = [0.1, 0.3, 0.5, 0.7, 0.9]  # Different correlation levels

results = []
for rho in rho_values:
    samples_original = dual_chain_sampler(num_samples, y_initial, rho)  # Get original samples
    samples_rb = [sum(pi_x_given_y(x_range, y)) / sum(pi_x_given_y(x_range, y)) for y in [random.choices([0, 1], weights=[f_y_given_x_y(0, x, y, rho), 1-f_y_given_x_y(0, x, y, rho)])[0] for x in samples_original]]
    var_original, var_rb = calculate_variances(samples_original, samples_rb)
    results.append((rho, var_original, var_rb))

# 4. Analyze and visualize ARE
for rho, var_original, var_rb in results:
    are = var_original / var_rb if var_rb > 0 else float('inf')  # Calculate ARE
    print(f"Rho: {rho:.1f}, Var(original): {var_original:.4f}, Var(RB): {var_rb:.4f}, ARE: {are:.2f}")

# (Optional) Plot ARE vs. Rho
rhos, are_values = zip(*[(rho, var_original / var_rb if var_rb > 0 else float('inf')) for rho, var_original, var_rb in results])
plt.plot(rhos, are_values, marker='o')
plt.xlabel("Rho")
plt.ylabel("Asymptotic Relative Efficiency (ARE)")
plt.title("ARE vs. Correlation (Rho)")
plt.grid(True)
plt.show()

## The EM-Gibbs Connection

As mentioned earlier, the EM algorithm can be seen as a precursor of the two-stage Gibbs sampler in missing data models (Section 5.3.1), in that it similarly exploits the conditional distribution of the missing variables. The connection goes further, as seen below.

Recall from Section 5.3.2 that, if $X \sim g(x|\theta)$ is the observed data, and we augment the data with $z$, where $Z \sim f(x, z|\theta)$, then we have the complete-data and incomplete-data likelihoods
$$
L^c(\theta|x, z) = f(x, z|\theta) \quad \text{and} \quad L(\theta|x) = g(x|\theta),
$$
with the missing data density
$$
k(z|x, \theta) = \frac{L^c(\theta|x, z)}{L(\theta|x)}.
$$

If we can normalize the complete-data likelihood in $\theta$ (this is the only condition for the equivalence mentioned above), that is, if $\int L^c(\theta|x, z) d\theta < \infty$, then define
$$
L^*(\theta|x, z) = \frac{L^c(\theta|x, z)}{\int L^c(\theta|x, z) d\theta}
$$
and create the two-stage Gibbs sampler:

(9.16)
$$
\begin{aligned}
1. \quad z|\theta &\sim k(z|x, \theta) \\
2. \quad \theta|z &\sim L^*(\theta|x, z).
\end{aligned}
$$

Note the direct connection to an EM algorithm based on $L^c$ and $k$. The "E" step in the EM algorithm calculates the expected value of the log-likelihood over $z$, often by calculating $\mathbb{E}(Z|x, \theta)$ and substituting in the log-likelihood. In the Gibbs sampler this step is replaced with generating a random variable from the density $k$. The "M" step of the EM algorithm then takes as the current value of $\theta$ the maximum of the expected complete-data log-likelihood. In the Gibbs sampler this step is replaced by generating a value of $\theta$ from $L^*$, the normalized complete-data likelihood.

**Example .21.** Censored data Gibbs. For the censored data example considered in Example 5.14, the distribution of the missing data is
$$
Z_i \sim \frac{\phi(z - \theta)}{1 - \Phi(a - \theta)}
$$
and the distribution of $\theta|x, z$ is
$$
L(\theta|x, z) \propto \prod_{i=1}^m \frac{1}{\sqrt{2\pi}} e^{-\frac{(x_i - \theta)^2}{2}} \prod_{i=m+1}^n \frac{1}{\sqrt{2\pi}} e^{-\frac{(z_i - \theta)^2}{2}}
$$
which corresponds to a
$$
\mathcal{N} \left( \frac{m\bar{x} + (n-m)\bar{z}}{n}, \frac{1}{n} \right)
$$
distribution, and so we immediately have that $L^*$ exists and that we can run a Gibbs sampler (Problem .14).

The validity of the "EM/Gibbs" sampler follows in a straightforward manner from its construction. The transition kernel of the Markov chain is
$$
\mathcal{K}(\theta, \theta'|x) = \int_{\mathcal{Z}} k(z|x, \theta) L^*(\theta'|x, z) dz
$$
and it can be shown (Problem 9.15) that the invariant distribution of the chain is the incomplete data likelihood, that is,
$$
\pi(\theta|x) = L(\theta|x).
$$

## The EM-Gibbs Connection (Continued)

Since $L(\theta'|x, z)$ is integrable in $\theta$, so is $L(\theta'|x)$, and hence the invariant distribution is a proper density. So the Markov chain is positive, and convergence follows from Theorem 9.6.

**Example .22.** Cellular phone Gibbs. As an illustration of the EM-Gibbs connection, we revisit Example.18, but now we use the Gibbs sampler to get our solution. From the complete data likelihood (5.18) and the missing data distribution (5.19) we have (Problem .16)

$$
p|W_1, W_2, ..., W_5, \Sigma X_i \sim D(W_1 + 1, W_2 + 1, ..., W_5 + \Sigma X_i + 1)
$$

$$
\Sigma X_i \sim Neg \left( \sum_{i=1}^m n_i + m, 1 - p_s \right)
$$

(9.17)

The results of the Gibbs iterations are shown in Figure 9.3. The point estimates agree with those of the EM algorithm (Example 5.18), $\hat{p} = (0.258, 0.313, 0.140, 0.118, 0.170)$, with the exception of $\hat{p}_5$, which is larger than the MLE. This may reflect the fact that the Gibbs estimate is a mean (and gets pulled a bit into the tail), while the MLE is a mode. Measures of error can be obtained from either the iterations or the histograms.

Based on the same functions $L(\theta|y, z)$ and $k(z|\theta, y)$ the EM algorithm will get the ML estimator from $L(\theta|y)$, whereas the Gibbs sampler will get us

![image.png](attachment:image.png)
![image-2.png](attachment:image-2.png)

In [None]:
import random
import matplotlib.pyplot as plt
import numpy as np  # For some calculations (optional)

# --- Define the distributions (replace with your actual distributions) ---

def d_distribution(W, X_sum):  # Placeholder for Dirichlet distribution
    """Dirichlet distribution (simplified placeholder)."""
    # Replace with your actual Dirichlet implementation.
    # This example returns a list of random probabilities.
    alpha = [w + 1 for w in W] + [X_sum + 1] # alpha parameters for Dirichlet
    return np.random.dirichlet(alpha)

def neg_binomial(n, p):  # Placeholder for Negative Binomial
    """Negative Binomial distribution (simplified placeholder)."""
    # Replace with your actual Negative Binomial implementation.
    # This example returns a random integer.
    return np.random.negative_binomial(n, p)

# --- EM-Gibbs Sampler ---

def em_gibbs_sampler(num_samples, initial_W, initial_X_sum, n_list, m):
    W_samples = []
    X_sum_samples = []
    W = initial_W
    X_sum = initial_X_sum

    for _ in range(num_samples):
        # 1. Sample from the conditional distributions
        p = d_distribution(W, X_sum)
        X_sum = neg_binomial(sum(n_list) + m, 1 - p[-1]) # Use p[-1] for p_s

        # 2. Update W (assuming you want the probabilities p)
        W = p[:-1] # Remove the last element which corresponds to p_s

        W_samples.append(W)
        X_sum_samples.append(X_sum)

    return W_samples, X_sum_samples

# --- Example Usage (replace with your actual data) ---

num_samples = 5000
initial_W = [1, 1, 1, 1, 1]  # Initial values for W (replace with your data)
initial_X_sum = 5  # Initial value for X_sum (replace with your data)
n_list = [10, 12, 8, 15, 9]  # Replace with your n_i values
m = 5  # Replace with your m value

W_samples, X_sum_samples = em_gibbs_sampler(num_samples, initial_W, initial_X_sum, n_list, m)

# --- Analyze and visualize the results ---

# Example: Plot the distribution of one of the W components
component_index = 0  # Choose which component of W to plot
w_values = [W[component_index] for W in W_samples]
plt.hist(w_values, bins=30)
plt.title(f"Distribution of W[{component_index}]")
plt.xlabel(f"W[{component_index}] Value")
plt.ylabel("Frequency")
plt.show()

# Example: Print point estimates (means)
W_means = [np.mean([W[i] for W in W_samples]) for i in range(len(initial_W))]
X_sum_mean = np.mean(X_sum_samples)

print("Point Estimates:")
print("W:", W_means)
print("X_sum:", X_sum_mean)

# ... (Add more analysis and visualization as needed)

In [None]:
import random
import matplotlib.pyplot as plt

# --- Define the distributions (replace with your actual distributions) ---

def dirichlet_distribution(W, X_sum):  # Placeholder for Dirichlet distribution
    """Dirichlet distribution (simplified placeholder, NO NUMPY)."""
    # Replace with your actual Dirichlet implementation without numpy.
    # This example returns a list of random probabilities.

    alpha = [w + 1 for w in W] + [X_sum + 1]  # alpha parameters for Dirichlet
    gamma_samples = [random.gammavariate(a, 1) for a in alpha]
    total_gamma = sum(gamma_samples)
    probabilities = [g / total_gamma for g in gamma_samples]
    return probabilities

def neg_binomial(n, p):  # Placeholder for Negative Binomial (NO NUMPY)
    """Negative Binomial distribution (simplified placeholder, NO NUMPY)."""
    # Replace with your actual Negative Binomial implementation without numpy.
    # This example returns a random integer.
    k = 0
    count = 0
    while count < n:
        if random.random() < p:
            count += 1
        k += 1
    return k

# --- EM-Gibbs Sampler ---

def em_gibbs_sampler(num_samples, initial_W, initial_X_sum, n_list, m):
    W_samples = []
    X_sum_samples = []
    W = initial_W
    X_sum = initial_X_sum

    for _ in range(num_samples):
        # 1. Sample from the conditional distributions
        p = dirichlet_distribution(W, X_sum)
        X_sum = neg_binomial(sum(n_list) + m, 1 - p[-1])  # Use p[-1] for p_s

        # 2. Update W (assuming you want the probabilities p)
        W = p[:-1]  # Remove the last element which corresponds to p_s

        W_samples.append(W)
        X_sum_samples.append(X_sum)

    return W_samples, X_sum_samples

# --- Example Usage (replace with your actual data) ---

num_samples = 5000
initial_W = [1, 1, 1, 1, 1]  # Initial values for W (replace with your data)
initial_X_sum = 5  # Initial value for X_sum (replace with your data)
n_list = [10, 12, 8, 15, 9]  # Replace with your n_i values
m = 5  # Replace with your m value

W_samples, X_sum_samples = em_gibbs_sampler(num_samples, initial_W, initial_X_sum, n_list, m)

# --- Analyze and visualize the results ---

# Example: Plot the distribution of one of the W components
component_index = 0  # Choose which component of W to plot
w_values = [W[component_index] for W in W_samples]
plt.hist(w_values, bins=30)
plt.title(f"Distribution of W[{component_index}]")
plt.xlabel(f"W[{component_index}] Value")
plt.ylabel("Frequency")
plt.show()

# Example: Print point estimates (means)
W_means = [sum(W[i] for W in W_samples) / len(W_samples) for i in range(len(initial_W))]
X_sum_mean = sum(X_sum_samples) / len(X_sum_samples)

print("Point Estimates:")
print("W:", W_means)
print("X_sum:", X_sum_mean)

# ... (Add more analysis and visualization as needed)