# Dual Linear Regression

## 1. Introduction

Dual linear regression reparameterizes the standard linear regression model in terms of training examples rather than input dimensions. This is particularly efficient when:
- Input dimensionality $D$ is high
- Number of training examples $I$ is small ($I < D$)

## 2. Mathematical Formulation

### Standard Prediction Model
The prediction model remains linear:

$P r(w_i|x_i) = Norm_{x_i}[\phi^T x_i, \sigma^2]$

### Dual Parameterization
The key difference is representing the slope parameters $\phi$ as:

$\phi = X\psi$

where:
- $\psi$ is an $I \times 1$ vector of weights
- $X$ is the data matrix
- Each $\phi$ is a weighted sum of training examples

### Complete Model
Substituting the dual parameterization:

$P r(w_i|x_i, \theta) = Norm_{x_i}[\psi^T X^T x_i, \sigma^2]$

For all data points:

$P r(w|X, \theta) = Norm_w[X^T X\psi, \sigma^2I]$

## 3. Maximum Likelihood Solution

The log-likelihood function is:

$\log P(w|X,\psi,\sigma^2) = -\frac{I}{2}\log[2\pi] - \frac{I}{2}\log[\sigma] - \frac{(w-X^TX\psi)^T(w-X^TX\psi)}{2\sigma^2}$

Maximizing with respect to $\psi$ and $\sigma^2$ gives:

$\hat{\psi} = (X^TX)^{-1}w$

$\hat{\sigma}^2 = \frac{(w-X^TX\psi)^T(w-X^TX\psi)}{I}$

## 4. Implementation

```python
import numpy as np
from scipy import stats

class DualLinearRegression:
    def __init__(self):
        self.psi = None
        self.sigma2 = None
        
    def fit(self, X, w):
        """
        Fit dual linear regression model
        
        Parameters:
        -----------
        X: array, shape (n_samples, n_features)
            Training data
        w: array, shape (n_samples,)
            Target values
        """
        # Compute dual parameters
        XTX = X.T @ X
        self.psi = np.linalg.solve(XTX, w)
        
        # Compute predictions
        w_pred = X.T @ X @ self.psi
        
        # Estimate noise variance
        residuals = w - w_pred
        self.sigma2 = residuals.T @ residuals / len(w)
        
        return self
    
    def predict(self, X_new, return_std=False):
        """
        Make predictions for new data
        
        Parameters:
        -----------
        X_new: array, shape (n_samples, n_features)
            New data points
        return_std: bool
            Whether to return standard deviation
            
        Returns:
        --------
        mean: array, shape (n_samples,)
            Predicted mean
        std: array, shape (n_samples,) (optional)
            Predicted standard deviation
        """
        mean = X_new.T @ X_new @ self.psi
        
        if return_std:
            std = np.sqrt(self.sigma2) * np.ones_like(mean)
            return mean, std
        
        return mean
```

## 5. Example Usage

```python
# Generate synthetic data
np.random.seed(42)
n_samples, n_features = 100, 500  # High-dimensional case
X = np.random.randn(n_samples, n_features)
true_phi = np.zeros(n_features)
true_phi[:5] = np.random.randn(5)  # Only first 5 features matter
w = X @ true_phi + np.random.randn(n_samples) * 0.1

# Fit dual regression
model = DualLinearRegression()
model.fit(X, w)

# Make predictions
X_test = np.random.randn(10, n_features)
mean_pred, std_pred = model.predict(X_test, return_std=True)
```

## 6. Relationship to Standard Linear Regression

The dual formulation gives equivalent results to standard linear regression:

$\hat{\phi} = X\hat{\psi} = X(X^TX)^{-1}w = (XX^T)^{-1}XX^TX(X^TX)^{-1}w = (XX^T)^{-1}Xw$

Key differences:
1. Computational efficiency when $I < D$
2. Natural transition to kernel methods
3. Basis for more advanced models like Relevance Vector Machines

## 7. Advantages and Limitations

### Advantages:
1. Efficient for high-dimensional data ($D \gg I$)
2. Natural framework for kernel methods
3. Same solution as standard linear regression

### Limitations:
1. Less efficient when $I > D$
2. Can only represent gradients in span of training data
3. Requires storing all training examples

![image.png](attachment:image.png)

Fig.12 Dual variables. Two di- mensional training data {xi }Ii=1 and associated world state {wi }Ii=1 (indi- cated by marker color). The linear re- gression parameter φ determines the direction in this 2D space in which w changes most quickly. We can alter- nately represent the gradient direc- tion as a weighted sum of data ex- amples. Here we show the case φ = ψ1 x1 + ψ2 x2 . In practical problems the data dimensionality D is greater than the number of examples I so we take a weighted sum φ = Xψ of all of the data points. This is the dual parameterization.

# Bayesian Dual Linear Regression

## 1. Bayesian Formulation

### Prior Distribution
We define a normal prior over the dual parameters $\psi$:

$P r(\psi) = Norm_\psi[0, \sigma_p^2I]$

where $\sigma_p^2$ represents our prior uncertainty.

### Posterior Distribution
Using Bayes' rule:

$P r(\psi|X, w, \sigma^2) = \frac{P r(X|w, \psi, \sigma^2)P r(\psi)}{P r(X|w, \sigma^2)}$

The posterior has a closed-form Gaussian distribution:

$P r(\psi|X, w, \sigma^2) = Norm_\psi[\frac{1}{\sigma^2}A^{-1}X^TXw, A^{-1}]$

where:

$A = \frac{1}{\sigma^2}X^TXX^TX + \frac{1}{\sigma_p^2}I$

### Predictive Distribution
The predictive distribution for a new input $x^*$ is:

$P r(w^*|x^*, X, w) = \int P r(w^*|x^*, \psi)P r(\psi|X, w) d\psi$

$= Norm_{w^*}[x^{*T}XA^{-1}X^TXw, x^{*T}XA^{-1}X^Tx^* + \sigma^2]$

## 2. Implementation

```python
import numpy as np
from scipy import stats
from scipy.linalg import solve

class BayesianDualRegression:
    def __init__(self, sigma_p2=100.0):
        """
        Parameters:
        -----------
        sigma_p2: float
            Prior variance
        """
        self.sigma_p2 = sigma_p2
        self.sigma2 = None
        self.A_inv = None
        self.X = None
        self.w = None
        
    def fit(self, X, w):
        """
        Fit Bayesian dual regression model
        
        Parameters:
        -----------
        X: array, shape (n_samples, n_features)
            Training data
        w: array, shape (n_samples,)
            Target values
        """
        self.X = X
        self.w = w
        
        # Estimate noise variance using ML
        self.sigma2 = self._estimate_sigma2()
        
        # Compute posterior parameters
        XTX = X.T @ X
        XTXXTX = XTX @ XTX
        
        # Compute A matrix
        A = (1/self.sigma2) * XTXXTX + (1/self.sigma_p2) * np.eye(len(w))
        
        # Store inverse for predictions
        self.A_inv = np.linalg.inv(A)
        
        return self
    
    def _estimate_sigma2(self):
        """Estimate noise variance using marginal likelihood"""
        XTX = self.X.T @ self.X
        XTXXTX = XTX @ XTX
        
        # Compute covariance matrix
        cov = self.sigma_p2 * XTXXTX + np.eye(len(self.w))
        
        # Compute log marginal likelihood
        log_ml = stats.multivariate_normal.logpdf(
            self.w, 
            mean=np.zeros_like(self.w), 
            cov=cov
        )
        
        # Use ML estimate for sigma2
        return np.var(self.w - self.X.T @ self.X @ np.linalg.solve(XTX, self.w))
    
    def predict(self, X_new, return_std=False):
        """
        Make predictions for new data
        
        Parameters:
        -----------
        X_new: array, shape (n_samples, n_features)
            New data points
        return_std: bool
            Whether to return standard deviation
            
        Returns:
        --------
        mean: array, shape (n_samples,)
            Predicted mean
        std: array, shape (n_samples,) (optional)
            Predicted standard deviation
        """
        # Compute mean prediction
        XTX = self.X.T @ self.X
        mean = X_new.T @ self.X @ self.A_inv @ XTX @ self.w / self.sigma2
        
        if return_std:
            # Compute predictive variance
            var = (X_new.T @ self.X @ self.A_inv @ self.X.T @ X_new + 
                  self.sigma2 * np.eye(X_new.shape[1]))
            std = np.sqrt(np.diag(var))
            return mean, std
        
        return mean
```

## 3. Example Usage

```python
# Generate synthetic data
np.random.seed(42)
n_samples, n_features = 100, 500
X = np.random.randn(n_samples, n_features)
true_psi = np.zeros(n_samples)
true_psi[:5] = np.random.randn(5)
w = X.T @ X @ true_psi + np.random.randn(n_samples) * 0.1

# Fit Bayesian model
model = BayesianDualRegression(sigma_p2=100.0)
model.fit(X, w)

# Make predictions with uncertainty
X_test = np.random.randn(10, n_features)
mean_pred, std_pred = model.predict(X_test, return_std=True)
```

## 4. Extension to Nonlinear Case

To extend to nonlinear regression, replace:
1. Training data $X$ with transformed data $Z = [z_1, z_2, ..., z_I]$
2. Test data $x^*$ with transformed test data $z^*$

The resulting expressions depend only on inner products:
- $Z^TZ$
- $Z^Tz^*$

This makes the model amenable to kernelization.

## 5. Marginal Likelihood

The marginal likelihood is:

$P r(w|X, \sigma^2) = Norm_w[0, \sigma_p^2X^TXX^TX + \sigma^2I]$

This can be used to:
1. Estimate $\sigma^2$
2. Compare different models
3. Select hyperparameters

In [None]:
import numpy as np
from scipy import stats
from scipy.linalg import solve

class BayesianDualRegression:
    def __init__(self, sigma_p2=100.0):
        """
        Parameters:
        -----------
        sigma_p2: float
            Prior variance
        """
        self.sigma_p2 = sigma_p2
        self.sigma2 = None
        self.A_inv = None
        self.X = None
        self.w = None
        
    def fit(self, X, w):
        """
        Fit Bayesian dual regression model
        
        Parameters:
        -----------
        X: array, shape (n_samples, n_features)
            Training data
        w: array, shape (n_samples,)
            Target values
        """
        self.X = X
        self.w = w
        
        # Estimate noise variance using ML
        self.sigma2 = self._estimate_sigma2()
        
        # Compute posterior parameters
        XTX = X.T @ X
        XTXXTX = XTX @ XTX
        
        # Compute A matrix
        A = (1/self.sigma2) * XTXXTX + (1/self.sigma_p2) * np.eye(len(w))
        
        # Store inverse for predictions
        self.A_inv = np.linalg.inv(A)
        
        return self
    
    def _estimate_sigma2(self):
        """Estimate noise variance using marginal likelihood"""
        XTX = self.X.T @ self.X
        XTXXTX = XTX @ XTX
        
        # Compute covariance matrix
        cov = self.sigma_p2 * XTXXTX + np.eye(len(self.w))
        
        # Compute log marginal likelihood
        log_ml = stats.multivariate_normal.logpdf(
            self.w, 
            mean=np.zeros_like(self.w), 
            cov=cov
        )
        
        # Use ML estimate for sigma2
        return np.var(self.w - self.X.T @ self.X @ np.linalg.solve(XTX, self.w))
    
    def predict(self, X_new, return_std=False):
        """
        Make predictions for new data
        
        Parameters:
        -----------
        X_new: array, shape (n_samples, n_features)
            New data points
        return_std: bool
            Whether to return standard deviation
            
        Returns:
        --------
        mean: array, shape (n_samples,)
            Predicted mean
        std: array, shape (n_samples,) (optional)
            Predicted standard deviation
        """
        # Compute mean prediction
        XTX = self.X.T @ self.X
        mean = X_new.T @ self.X @ self.A_inv @ XTX @ self.w / self.sigma2
        
        if return_std:
            # Compute predictive variance
            var = (X_new.T @ self.X @ self.A_inv @ self.X.T @ X_new + 
                  self.sigma2 * np.eye(X_new.shape[1]))
            std = np.sqrt(np.diag(var))
            return mean, std
        
        return mean
np.random.seed(42)
n_samples, n_features = 100, 500
X = np.random.randn(n_samples, n_features)
true_psi = np.zeros(n_samples)
true_psi[:5] = np.random.randn(5)
w = X.T @ X @ true_psi + np.random.randn(n_samples) * 0.1
model = BayesianDualRegression(sigma_p2=100.0)
model.fit(X, w)

## Relevance Vector Regression

Having developed the dual approach to linear regression, we are now in a position to develop a model that depends only sparsely on the training data. This model is known as Relevance Vector Regression (RVR).
![image.png](attachment:image.png)

**(Fig.13: Relevance vector regression. A prior applying sparseness is applied to the dual parameters. This means that the final classifier only depends on a subset of the data points (indicated by the six larger points). The resulting regression function is considerably faster to evaluate and tends to be simpler: this means it is less likely to overfit to random statistical fluctuations in the training data and generalizes better to new data.)**

### Sparseness Prior

To achieve sparsity, we impose a penalty for every non-zero weighted training example.  This is done by replacing the normal prior over the dual parameters $\psi$ with a product of one-dimensional t-distributions:

$$
\prod_{i=1}^I Pr(\psi_i) = \prod_{i=1}^I \text{Student-t}(\psi_i | 0, 1, \nu)  \qquad (8.53)
$$

This model is known as relevance vector regression. This situation is analogous to the sparse linear regression model (Section 8.6), except that we are now working with dual variables.

### Marginal Likelihood Approximation

As in the sparse model, it is not possible to marginalize over the variables $\psi$ with the t-distributed prior.  We approximate the t-distributions by maximizing with respect to their hidden variables rather than marginalizing (Equation 8.35). By analogy with Section 8.6, the marginal likelihood becomes:

$$
Pr(w|X, \sigma^2) \approx \max_{H} \text{Norm}_w[0, (X^T X + \sigma^2 I)^{-1} X^T X] \prod_{d=1}^D \text{Gamma}(h_d | \nu/2, \nu/2) \qquad (8.54)
$$

where the matrix $H$ contains the hidden variables $\{h_i\}_{i=1}^I$ associated with the t-distribution on its diagonal and zeros elsewhere. This expression is similar to Equation 8.52, except that instead of every data point having the same prior variance $\sigma_p^2$, they now have individual variances determined by the hidden variables in the diagonal matrix $H$.

### Optimization Procedure

In relevance vector regression, we alternately:

**(i)** Optimize the marginal likelihood with respect to the hidden variables:

$$
h_i^{\text{new}} = \frac{1 - h_i \Sigma_{ii} + \nu}{\mu_i^2 + \nu} \qquad (8.55)
$$

**(ii)** Optimize the marginal likelihood with respect to the variance parameter $\sigma^2$:

$$
(\sigma^2)^{\text{new}} = \frac{1}{I - \sum_i (1 - h_i \Sigma_{ii})} (w - X^T X \mu)^T (w - X^T X \mu) \qquad (8.56)
$$

Between each step, we update the mean $\mu$ and variance $\Sigma$ of the posterior distribution:

$$
\begin{aligned}
\mu &= A^{-1} X^T X w \\
\Sigma &= A^{-1}
\end{aligned} \qquad (8.57)
$$

where $A$ is defined as:

$$
A = \frac{1}{\sigma^2} X^T X + H \qquad (8.58)
$$

### Sparsity

At the end of training, data examples where the hidden variable $h_i$ is large (e.g., > 1000) are discarded, as the corresponding coefficients $\psi_i$ will be very small and contribute almost nothing to the solution.

### Nonlinear Version

Since this algorithm depends only on inner products, a nonlinear version can be generated by replacing the inner products with a kernel function $k[x_i, x_j]$. If the kernel contains parameters, these may also be manipulated to improve the log marginal variance during fitting. Figure 8.13 shows an example fit using the RBF kernel. The final solution depends only on six data points but still captures the important aspects of the data.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Original parameterization (x1, x2) - highly correlated
mean = [0, 0]
covariance = [[1, 0.99], [0.99, 1]]  # Strong correlation

# Reparameterization (y1, y2) - less correlated
eigenvalues, eigenvectors = np.linalg.eig(covariance)

def transform_to_y(x):
    return np.dot(eigenvectors.T, x)

def transform_to_x(y):
    return np.dot(eigenvectors, y)

# Sampling (Metropolis-Hastings - simplified for illustration)
n_samples = 1000
x_samples = []
y = [0, 0]  # Initial values in the y-space

for _ in range(n_samples):
    y_proposal = [np.random.normal(0, 1), np.random.normal(0, 1)]  # Independent proposals
    x = transform_to_x(y)
    x_proposal = transform_to_x(y_proposal)

    # Simplified acceptance probability (ignoring Jacobian for simplicity here)
    # In a real example, you would need to include the Jacobian term
    prob_current = np.exp(-0.5 * np.dot(x, np.linalg.inv(covariance), x))
    prob_proposal = np.exp(-0.5 * np.dot(x_proposal, np.linalg.inv(covariance), x_proposal))
    alpha = min(1, prob_proposal / prob_current)

    if np.random.uniform() < alpha:
        y = y_proposal
    x_samples.append(transform_to_x(y))

x_samples = np.array(x_samples)

# Plotting
plt.figure(figsize=(8, 6))
plt.scatter(x_samples[:, 0], x_samples[:, 1], s=10)
plt.title("Samples in Original (x1, x2) Space")
plt.xlabel("x1")
plt.ylabel("x2")
plt.show()

In [None]:
import random
import math
import matplotlib.pyplot as plt  # For plotting (no Seaborn without NumPy)

def gibbs_metropolization(y_initial, n_iterations, g, g_prop, *args):
    """Gibbs sampling with Metropolis - No NumPy."""
    n_vars = len(y_initial)
    y = list(y_initial)
    samples = [list(y)]

    for t in range(n_iterations):
        for i in range(n_vars):
            z_i = g_prop(y, i, *args)

            log_alpha = math.log(1 - g(y, *args)) - math.log(1 - g(z_i, *args)) if min(g(y, *args), g(z_i, *args)) > 0 else -math.inf
            alpha = math.exp(log_alpha) if log_alpha > -math.inf else 0
            alpha = min(1, alpha)

            u = random.random()
            if u < alpha:
                y[i] = z_i[i]
        samples.append(list(y))

    return samples


# Example distributions (replace with your own)
def g(y, *args):
    """Example joint distribution (replace)."""
    a, b, x_max = args
    prob = 1
    for i in range(len(y)):
        if i < 2:
            prob *= binom_pmf(y[i], x_max[i], a[i] * 0.5 / (a[i] * 0.5 + b[i]))
        else:
            prob *= binom_pmf(y[i], x_max[i], a[i] * 0.5 / (a[i] * 0.5 + b[i]))
    return prob

def binom_pmf(k, n, p):
    """Binomial PMF."""
    if 0 <= k <= n:
        coeff = math.factorial(n) // (math.factorial(k) * math.factorial(n - k))
        return coeff * (p**k) * ((1 - p) ** (n - k))
    else:
        return 0

def g_prop(y, i, *args):
    """Example proposal distribution (replace)."""
    a, b, x_max = args
    z = list(y)
    while True:
        if i < 2:
            z[i] = int(random.gauss(x_max[i] * a[i] * 0.5 / (a[i] * 0.5 + b[i]), 1)) # Example with mu=0.5
        else:
            z[i] = int(random.gauss(x_max[i] * a[i] * 0.5 / (a[i] * 0.5 + b[i]), 1)) # Example with eta=0.5
        if 0 <= z[i] <= x_max[i] and z[i] != y[i]:
            break
    return z


# Example usage:
a = [0.06, 0.14, 0.11, 0.09]
b = [0.17, 0.24, 0.19, 0.20]
x_max = [9, 15, 12, 7]
y_initial = [0, 0, 0, 0]
n_iterations = 10000

samples = gibbs_metropolization(y_initial, n_iterations, g, g_prop, a, b, x_max)



# --- Plotting with Matplotlib (no Seaborn) ---

# 1. Trace Plots
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
fig.suptitle("Trace Plots of Samples", fontsize=16)

for i in range(4):
    row = i // 2
    col = i % 2
    y_i_values = [sample[i] for sample in samples]  # Extract y_i values
    axes[row, col].plot(range(n_iterations + 1), y_i_values)
    axes[row, col].set_title(f"Trace of y{i+1}")
    axes[row, col].set_xlabel("Iteration")
    axes[row, col].set_ylabel(f"y{i+1} Value")

plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()


# 2. Distribution Plots (Histograms)
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
fig.suptitle("Distribution of Samples", fontsize=16)

for i in range(4):
    row = i // 2
    col = i % 2
    y_i_values = [sample[i] for sample in samples]
    axes[row, col].hist(y_i_values, density=True)  # density=True for probabilities
    axes[row, col].set_title(f"Distribution of y{i+1}")
    axes[row, col].set_xlabel(f"y{i+1} Value")
    axes[row, col].set_ylabel("Frequency")

plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()

# ... (Autocorrelation and Running Mean plots can be added using Matplotlib)


## Regression to Multivariate Data

Throughout this chapter, we have discussed predicting a scalar value $w_i$ from multivariate data $x_i$. In real-world situations such as the pose regression problem, the world states $w_i$ are multivariate. It is trivial to extend the models in this chapter: we simply construct a separate regressor for each dimension. The exception to this rule is the relevance vector machine (RVM): here we might want to ensure that the sparse structure is common for each of these models so the efficiency gains are retained. To this end, we modify the model so that a single set of hidden variables is shared across the model for each world state dimension.

## Applications

Regression methods are used less frequently in vision than classification, but nonetheless, there are many useful applications. The majority of these involve estimating the position or pose of objects since the unknowns in such problems are naturally treated as continuous.

### Human Body Pose Estimation
![image.png](attachment:image.png)

Agarwal & Triggs (2006) developed a system based on the relevance vector machine to predict body pose $w$ from silhouette data $x$. To encode the silhouette, they computed a 60-dimensional shape context feature at each of 400-500 points on the boundary of the object. To reduce the data dimensionality, they computed the similarity of each shape context feature to each of 100 different prototypes. Finally, they formed a 100-dimensional histogram containing the aggregated 100-dimensional similarities for all of the boundary points. This histogram was used as the data vector $x$.

The body pose was encoded by the 3 joint angles of each of the 18 major body joints and the overall azimuth (compass heading) of the body. The resulting 55-dimensional vector was used as the world state $w$.

A relevance vector machine was trained using 2636 data vectors $x_i$ extracted from silhouettes that were rendered using the commercial program **POSER** from known motion capture data $w_i$. Using a radial basis function kernel, the relevance vector machine based its solution on just **6%** of these training examples. The body pose angles of test data could be predicted to within an average of **6°** error:

$$ \text{Mean test error} = 6.0^\circ $$

They also demonstrated that the system worked reasonably well on silhouettes from real images.

Silhouette information is by its nature ambiguous: it is very hard to tell which leg is in front of the other based on a single silhouette. Agarwal & Triggs (2006) partially circumvented this system by tracking the body pose $w_i$ through a video sequence. Essentially, the ambiguity at a given frame is resolved by encouraging the estimated pose in adjacent frames in the sequence to be similar: information from frames where the pose vector is well-defined is propagated through the sequence to resolve ambiguities in other parts.

However, the ambiguity of silhouette data is an argument for not using this type of classifier: the regression models in this chapter are designed to give a unimodal normal output. To effectively classify single frames of data, we should use a regression method that produces a **multi-modal prediction** that can effectively describe the ambiguity.

![image-2.png](attachment:image-2.png)
Fig.15 Tracking using displacement experts. The goal of the system i to predict a displacement vector indicating the motion of the object base on the pixel data at its last known position. a) The system is trained b perturbing the bounding box around the object to simulate the motion o the object. b) The system successfully tracks a face, even in the presenc c) of partial occlusions. d) If the system is trained using gradient vector rather than raw pixel values, it is also quite robust to changes in illumination Adapted from Williams et al. (2005) �2005 c IEEE.

```markdown
# Regression to Multivariate Data

Throughout this chapter, we have discussed predicting a scalar value $w_i$ from multivariate data $x_i$. In real-world situations such as the pose regression problem, the world states $w_i$ are multivariate. It is trivial to extend the models in this chapter: we simply construct a separate regressor for each dimension. The exception to this rule is the relevance vector machine (RVM): here we might want to ensure that the sparse structure is common for each of these models so the efficiency gains are retained. To this end, we modify the model so that a single set of hidden variables is shared across the model for each world state dimension.

## Applications

Regression methods are used less frequently in vision than classification, but nonetheless, there are many useful applications. The majority of these involve estimating the position or pose of objects since the unknowns in such problems are naturally treated as continuous.

### Human Body Pose Estimation

Agarwal & Triggs (2006) developed a system based on the relevance vector machine to predict body pose $w$ from silhouette data $x$. To encode the silhouette, they computed a 60-dimensional shape context feature at each of 400-500 points on the boundary of the object. To reduce the data dimensionality, they computed the similarity of each shape context feature to each of 100 different prototypes. Finally, they formed a 100-dimensional histogram containing the aggregated 100-dimensional similarities for all of the boundary points. This histogram was used as the data vector $x$.

The body pose was encoded by the 3 joint angles of each of the 18 major body joints and the overall azimuth (compass heading) of the body. The resulting 55-dimensional vector was used as the world state $w$.

A relevance vector machine was trained using 2636 data vectors $x_i$ extracted from silhouettes that were rendered using the commercial program **POSER** from known motion capture data $w_i$. Using a radial basis function kernel, the relevance vector machine based its solution on just **6%** of these training examples. The body pose angles of test data could be predicted to within an average of **6°** error:

$$ \text{Mean test error} = 6.0^\circ $$

They also demonstrated that the system worked reasonably well on silhouettes from real images.

Silhouette information is by its nature ambiguous: it is very hard to tell which leg is in front of the other based on a single silhouette. Agarwal & Triggs (2006) partially circumvented this system by tracking the body pose $w_i$ through a video sequence. Essentially, the ambiguity at a given frame is resolved by encouraging the estimated pose in adjacent frames in the sequence to be similar: information from frames where the pose vector is well-defined is propagated through the sequence to resolve ambiguities in other parts.

However, the ambiguity of silhouette data is an argument for not using this type of classifier: the regression models in this chapter are designed to give a unimodal normal output. To effectively classify single frames of data, we should use a regression method that produces a **multi-modal prediction** that can effectively describe the ambiguity.

### Displacement Experts

Regression models are also used to form displacement experts in tracking applications. The goal is to take a region of the image $x$ and return a set of numbers $w$ that indicate the change in position of an object relative to the window. The world state $w$ might simply contain the horizontal and vertical translation vectors or might contain parameters of a more complex 2D transformation (chapter 15). For simplicity, we will describe the former situation.

Training data are extracted as follows. A bounding box around the object of interest (car, face, etc.) is identified in a number of frames. For each of these frames, the bounding box is perturbed by a pre-determined set of translation vectors, to simulate the object moving in the opposite direction (figure 8.15a). In this way, we associate a translation vector $w_i$ with each perturbation. The data from the perturbed bounding box are extracted, resized to a standard shape, and histogram equalized (section 13.1.2) to induce a degree of invariance to illumination changes. The resulting values are then concatenated to form the data vector $x_i$.

Williams et al. (2005) describe a system of this kind in which the elements of $w$ were learned by a set of independent relevance vector machines. They initialize the position of the object using a standard object detector (see chapter 9). In the subsequent frame, they compute a prediction for the displacement vector $w$ using the relevance vector machines on the data $x$ from the original position. This prediction is combined in a Kalman filter-like system (chapter 19) that imposes prior knowledge about the continuity of the motion to create a robust method for tracking known objects in scenes. Fig.15b-d show a series of tracking results from this system.





In [None]:
import numpy as np
import cv2
from sklearn_rvm import EMRVR  # Relevance Vector Machine for Regression
from sklearn.model_selection import train_test_split

# Generate synthetic training data (for demonstration)
def generate_training_data(num_samples=1000):
    X = np.random.rand(num_samples, 64, 64)  # Simulated image patches (64x64)
    w = np.random.rand(num_samples, 2) * 10 - 5  # Random translation vectors (-5 to 5)
    return X, w

# Preprocess data: Flatten images & normalize
def preprocess_data(X):
    return X.reshape(X.shape[0], -1) / 255.0  # Flatten and normalize

# Train RVM for displacement estimation
def train_displacement_expert(X, w):
    model = EMRVR(kernel='rbf', gamma=0.1)  # Radial Basis Function kernel
    model.fit(X, w)  # Train on input patches and translation vectors
    return model

# Predict displacement of an object in a new frame
def predict_displacement(model, new_patch):
    new_patch = preprocess_data(new_patch.reshape(1, 64, 64))
    return model.predict(new_patch)[0]

# Main execution
if __name__ == "__main__":
    X, w = generate_training_data()
    X = preprocess_data(X)

    # Split into train/test sets
    X_train, X_test, w_train, w_test = train_test_split(X, w, test_size=0.2, random_state=42)

    # Train model
    model = train_displacement_expert(X_train, w_train)

    # Simulate tracking with a new image patch
    test_patch = np.random.rand(64, 64)  # Simulated new image patch
    predicted_displacement = predict_displacement(model, test_patch)

    print(f"Predicted Displacement Vector: {predicted_displacement}")
