In [None]:
'''
 * Copyright (c) 2008 Radhamadhab Dalai
 *
 * Permission is hereby granted, free of charge, to any person obtaining a copy
 * of this software and associated documentation files (the "Software"), to deal
 * in the Software without restriction, including without limitation the rights
 * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 * copies of the Software, and to permit persons to whom the Software is
 * furnished to do so, subject to the following conditions:
 *
 * The above copyright notice and this permission notice shall be included in
 * all copies or substantial portions of the Software.
 *
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 * THE SOFTWARE.
'''

#### Inference for Mixtures

Although they may seem to apply only to some very particular sets of random phenomena, _mixtures of distributions_ (9.3) are of wide use in practical modeling.

However, as already noticed in Examples 1.2 and 3.7, they can be challenging from an inferential point of view (that is, when estimating the parameters $p_j$and $xi_j$. Everitt (1984), Titterington et al. (1985), MacLachlan and Basford (1988), West (1992), Titterington (1996), Robert (1996a), and Marin et al. (2004) all provide different perspectives on mixtures of distributions, discuss their relevance for modeling purposes, and give illustrations of their use in various setups of (9.5).

We assume, without a considerable loss of generality, that <span class="math-inline">f\(\\cdot\|\\xi\) belongs to an exponential family
$$
f(x|\xi) = h(x) \exp\{\xi' T(x) - A(\xi)\},
$$
and we consider the associated conjugate prior on <span class="math-inline">\\xi</span> (see Robert 2001, Section 3.3)
$$
\pi(\xi|\lambda, \alpha) \propto \exp\{\xi' \lambda - \alpha A(\xi)\}, \quad \lambda > 0, \alpha > 0.
$$

For the mixture (9.5), it is therefore possible to associate with each component <span class="math-inline">f\(\\cdot\|\\xi\_j\) \(j \= 1, \.\.\., k\)</span> a conjugate prior <span class="math-inline">\\pi\(\\xi\_j\|\\lambda\_j, \\alpha\_j\)</span>. We also select for <span class="math-inline">\(p\_1, \.\.\., p\_k\)</span> the standard Dirichlet conjugate prior, that is,
$$
(p_1, ..., p_k) \sim \mathcal{D}(\gamma_1, ..., \gamma_k).
$$

Given a sample <span class="math-inline">\(x\_1, \.\.\., x\_n\)</span> from (9.5), and conjugate priors on <span class="math-inline">\\xi</span> and <span class="math-inline">\(p\_1, \.\.\., p\_k\)</span> (see Robert 2001, Section 3.3), the posterior distribution associated with this model is formally explicit (see Problem 9.25). However, it is virtually useless for large, or even moderate, values of <span class="math-inline">n</span>. In fact, the posterior distribution,
$$
\pi(p, \xi|x_1, ..., x_n) \propto \prod_{i=1}^n \left( \sum_{j=1}^k p_j f(x_i|\xi_j) \right) \prod_{j=1}^k \pi(\xi_j| \cdot) \pi(p|\cdot),
$$
is better expressed as a sum of <span class="math-inline">k^n</span> terms which correspond to the different allocations of the observations <span class="math-inline">x\_i</span> to the components of (9.5). Although each term is conjugate, the number of terms involved in the posterior distribution makes the computation of the normalizing constant and of posterior expectations totally infeasible for large sample sizes (see Diebolt and Robert 1990a). (In a simulation experiment, Casella et al. 1999 actually noticed that very few of these terms carry a significant posterior weight, but there is no manageable approach to determine which terms are relevant and which are not.) The complexity of this model is such that there are virtually no other solutions than using the Gibbs sampler (see, for instance, Smith and Makov 1978 or Bernardo and Girón 1986, 1988, for pre-Gibbs approximations).

The solution proposed by Diebolt and Robert (1990c,b, 1994), Lavine and West (1992), Verdinelli and Wasserman (1992), and Escobar and West (1995) is to take advantage of the missing data structure inherent to (9.5), as in Example 9.2.

Good performance of the Gibbs sampler is guaranteed by the above setup since the Duality Principle of Section 9.2.3 applies. One can also deduce geometric convergence and a Central Limit Theorem. Moreover, Rao-Blackwellization is justified (see Problem 9.25).

The practical implementation of [4.34] might, however, face serious convergence difficulties, in particular because of the phenomenon of the "absorbing component" (Diebolt and Robert 1990, Mengersen and Robert 1996, Robert 1996). When only a small number of observations are allocated to a given component <span class="math-inline">j\_0</span>, the following probabilities are quite small:

## Trapping States in Slice Sampler: Continuation of Example .2

This section examines the **trapping states** phenomenon, particularly in the context of Bayesian hierarchical models and Gibbs sampling. Trapping states occur when the sampler requires an enormous number of iterations to escape a specific state, leading to computational inefficiency.

---

## The Problem of Trapping States

In the slice sampler algorithm (Algorithm A.34), while the Markov chain $ (X^{(t)})_{t \geq 0} $ is irreducible, trapping states can arise due to improper Bayesian approaches or poor tuning of hyperparameters. The key issues include:
1. **Likelihood Dominance**: Allocating new observations to a component already assigned to $ j_0 $.
2. **Low Escape Probability**: Reallocation to other components becomes highly improbable.

These issues can become computational bottlenecks, particularly when $ \gamma_j $ values are small.

---

## Example .3: Bayesian Mixture Model with Small Component Weights

### Setting

Consider a Bayesian mixture model where:
- $ \lambda_j \ll 1 $, for $ j = 1, \dots, k $, and a single observation is allocated to $ j_0 $.  
- Algorithm A.38 is used to sample from the posterior.

From the hierarchical structure of the model, we approximate:

$$
[Z_{i0} | T_{i0}, \nu] \sim \mathcal{N}(\xi_{i0}, \tau^2), \quad T_{i0} \sim 2Q_{\beta_j}(\nu, \tau^2),
$$

where:
- $ \beta_j \ll \lambda_j $,
- $ \nu \approx \xi_{i0} $,
- $ \tau \approx 0 $.

### Allocation Probabilities

Using the above setup:
1. For $ Z_{i0} = j_0 $, the probability is:

$$
P(Z_{i0} = j_0) \propto \pi_{j_0} e^{-(\xi_{i0} - \nu)^2 / 2 \tau^2}.
$$

2. For $ Z_{i0} \neq j_0 $, the probability is:

$$
P(Z_{i0} \neq j_0) \propto \pi_j e^{-(\xi_{i0} - \nu)^2 / 2 \tau^2}.
$$

Given the very rapid decay of $ e^{-t^2 / 2 \tau^2} $, the likelihood of escaping state $ j_0 $ is extremely small.

---

## Trapping and Solutions

Trapping states are particularly problematic because they "lock" the sampler into specific configurations due to computational rounding errors or improper Bayesian approaches.

To mitigate this issue:
- **Wider Moves**: Introduce Metropolis-Hastings steps or proposals with wider moves.
- **Improved Estimation**: Use advanced sampling methods, such as blocked Gibbs sampling or reparameterization, to reduce the trapping phenomenon.

For further details, refer to **Problem 9.19** and **Lemma 9.20** in the referenced material.




# ARCH Models with Gibbs Sampler

The **Auto Regressive Conditionally Heteroscedastic (ARCH)** model is a key tool in econometrics and finance for analyzing time-varying volatility. This section explores its formulation and estimation using the Gibbs sampler.

---

## Model Definition

For $ t = 2, \dots, T $, a **Gaussian ARCH** model is defined as:

$$
\begin{aligned}
Z_t &= \left( \alpha + \beta Z_{t-1}^2 \right)^{1/2} \epsilon_t, \quad \epsilon_t \sim \mathcal{N}(0, 1), \\
X_t &= a Z_t + \xi_t, \quad \xi_t \sim \mathcal{N}(0, \sigma^2).
\end{aligned}
$$

- Parameters:  
  - $$ \theta = (a, \beta, \sigma) $ represents the parameter vector of interest.
  - $$ Z_t $ are latent variables, and only $ X_t $ values are observed.  

- Notation:  
  $ X^T = (X_1, \dots, X_T) $, representing observed data.  
  $ Z^T = (Z_1, \dots, Z_T) $, representing unobserved data.

---

## Special Case and Likelihood

For simplicity, consider the special case:

$$
\begin{aligned}
Z_t &= (1 + \beta Z_{t-1}^2)^{1/2} \epsilon_t, \quad \epsilon_t \sim \mathcal{N}(0, 1), \\
X_t &= a Z_t + \xi_t, \quad \xi_t \sim \mathcal{N}(0, \sigma^2).
\end{aligned}
$$

### Likelihood Function
The likelihood is based on marginalizing out the latent variables \( Z^T \):

\[
L(\theta | X^T) = \mathbb{E}\left[ \frac{f(X^T, Z^T | \theta)}{f(Z^T | \theta)} \bigg| X^T \right].
\]

Approximating $ Z^T $ with $ m $ samples:

$$
h_m(\theta) = \frac{1}{m} \sum_{i=1}^m \frac{f(X^T, Z^T | \theta)}{f(Z^T | \theta)}, \quad Z_i \sim f(Z^T | X^T, \theta).
$$

This approach simplifies computation using simulation and sampling.

---

## Simulation-Based Approximation

The distribution $ f(Z^T | X^T, \theta) $ is defined as:

$$
f(Z^T | X^T, \theta) \propto \exp\left(-\frac{1}{2} \sum_{t=1}^T \frac{|X_t - aZ_t|^2}{\sigma^2}\right) \prod_{t=2}^T \frac{1}{(1 + \beta Z_{t-1}^2)^{1/2}} e^{-Z_t^2 / 2(1+\beta Z_{t-1}^2)}.
$$

Using this distribution, $ h_m(\theta) $ can be approximated as:

$$
h_m(\theta) = \frac{1}{m} \sum_{i=1}^m \frac{f(X^T | Z_i, \theta)}{f(Z_i | \theta)}.
$$

---

## Example: Estimation Results

Billio et al. (1998) simulated $ T = 100 $ observations with:
- $ a = (-0.2, 0.6) $,
- $ \beta = 0.8 $,
- $ \sigma^2 = 0.2 $.

Table 9.3 summarizes the approximate maximum likelihood estimation. The estimates are sensitive to the prior choice and the number of samples $ m $. Using $ m = 50,000 $, the estimated parameters demonstrate a flat likelihood surface, indicating potential challenges in parameter estimation.




In [1]:
import random
import math

# Generate simulated data for the ARCH(1) model
def generate_data(T, a, beta, sigma2):
    Z = [random.gauss(0, 1)]  # Initialize Z_t with Z_1
    X = [a * Z[0] + random.gauss(0, math.sqrt(sigma2))]  # Generate X_1
    for t in range(1, T):
        Z_t = math.sqrt(1 + beta * Z[-1]**2) * random.gauss(0, 1)
        X_t = a * Z_t + random.gauss(0, math.sqrt(sigma2))
        Z.append(Z_t)
        X.append(X_t)
    return X, Z

# Gibbs Sampling function for ARCH(1)
def gibbs_sampler(X, num_iterations, a_init, beta_init, sigma2_init):
    T = len(X)
    a = a_init
    beta = beta_init
    sigma2 = sigma2_init
    Z = [random.gauss(0, 1) for _ in range(T)]  # Initialize latent Z_t

    samples = []

    for iter in range(num_iterations):
        # Step 1: Sample Z_t given X_t, a, beta, sigma2
        for t in range(T):
            mean = a * Z[t - 1] if t > 0 else 0  # Conditional mean
            variance = 1 + beta * (Z[t - 1]**2 if t > 0 else 0)
            Z[t] = random.gauss(mean, math.sqrt(variance))
        
        # Step 2: Sample a given Z_t and X_t
        numerator = sum(Z[t] * X[t] for t in range(T))
        denominator = sum(Z[t]**2 for t in range(T))
        a = numerator / denominator

        # Step 3: Sample beta given Z_t
        numerator_beta = sum(Z[t]**2 for t in range(1, T))
        denominator_beta = sum(Z[t - 1]**2 for t in range(1, T))
        beta = numerator_beta / (1 + denominator_beta)

        # Step 4: Sample sigma2 given Z_t and X_t
        residuals = [X[t] - a * Z[t] for t in range(T)]
        sigma2 = sum(r**2 for r in residuals) / T

        # Store samples
        samples.append((a, beta, sigma2))

    return samples

# Hyperparameters and data generation
T = 100
true_a = -0.2
true_beta = 0.8
true_sigma2 = 0.2
X, Z_true = generate_data(T, true_a, true_beta, true_sigma2)

# Initialize parameters for Gibbs Sampling
a_init = 0.0
beta_init = 0.5
sigma2_init = 0.5
num_iterations = 1000

# Run Gibbs Sampler
samples = gibbs_sampler(X, num_iterations, a_init, beta_init, sigma2_init)

# Analyze and print results
a_samples = [s[0] for s in samples]
beta_samples = [s[1] for s in samples]
sigma2_samples = [s[2] for s in samples]

print(f"Posterior Mean Estimates:\n")
print(f"a: {sum(a_samples) / len(a_samples):.4f}")
print(f"beta: {sum(beta_samples) / len(beta_samples):.4f}")
print(f"sigma^2: {sum(sigma2_samples) / len(sigma2_samples):.4f}")


Posterior Mean Estimates:

a: -0.0001
beta: 1.0149
sigma^2: 0.2880
