# Objective

In this second notebook, we'll dive deeper into the concept of conjugate priors in Bayesian linear regression (BLR).

We'll see how by using conjugate distributions, we can derive posterior distributions analytically and thus simplifying Bayesian inference. 

After providing the theory, we will see how this inference can be done using `skpro`'s `BayesianConjugateLinearRegressor` class.

# Imports and Helper Functions

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.api as sm
from IPython.display import Math, display
from utils import style_data

from skpro.regression.bayesian import BayesianConjugateLinearRegressor

In [None]:
%load_ext autoreload
%autoreload 2

# Theory

## Bayesian linear regression: recap

As mentioned in the first notebook, **Bayesian linear regression (BLR)** is a probabilistic approach to linear regression where prior beliefs about the model parameters are combined with observed data to compute posterior distributions. Unlike traditional linear regression, which only provides point estimates for parameters and predictions, BLR offers a complete distribution for them, providing an intuitive way to reason about uncertainty in both parameters and predictions.

As a reminder, BLR builds on Bayes' theorem:

$$
\begin{align*}
\color{green}{P(\mathbf{w} \mid \mathbf{t})} &= \frac{\color{blue}{P(\mathbf{t} \mid \mathbf{w})} \times \color{orange}{P(\mathbf{w})}}{\color{purple}{P(\mathbf{t})}} \\
\color{green}{\text{posterior}} &= \frac{\color{blue}{\text{likelihood}} \times \color{orange}{\text{prior}}}{\color{purple}{\text{marginal likelihood}}}
\end{align*}
\tag{1}
$$

Where:

- $\color{orange}{P(\mathbf{w})}$ is the $\color{orange}{\textbf{prior}}$ for parameters $\mathbf{w}$, reflecting our beliefs about $\mathbf{w}$ before observing the data. 


- $\color{green}{P(\mathbf{w} \mid \mathbf{t})}$ represents the $\color{green}{\textbf{posterior}}$ for $\mathbf{w}$ given the observed data $(\mathbf{t}, \mathbf{X})$. It combines the prior and the likelihood and quantifies our belief about $\mathbf{w}$ *after* observing the data. 


- $\color{purple}{P(\mathbf{t})}$ is the $\color{purple}{\textbf{marginal likelihood}}$, which ensures that the posterior  sums to one. 


- $\color{blue}{P(\mathbf{t} \mid \mathbf{w})}$  is the $\color{blue}{\textbf{likelihood}}$ of the target data $\mathbf{t}$ given the input data $\mathbf{X}$, parameters $\mathbf{w}$, and noise precision $\beta$. It measures how well a particular set of parameters $\mathbf{w}$ explains the observed target values $\mathbf{t}$. Assuming each data point $(\mathbf{x}_n, t_n)$ is drawn independently, the likelihood is:

$$
\begin{aligned}
\color{blue}{P(\mathbf{t} \mid \mathbf{w})} &= \prod_{n=1}^N \mathcal{N}(t_n \mid \mathbf{w}^T \mathbf{x}_n, \beta^{-1}) \\
&\propto \exp \left( -\frac{\beta}{2} \sum_{n=1}^N (t_n - \mathbf{w}^T \mathbf{x}_n)^2 \right) \\
&\propto \exp \left( -\frac{\beta}{2} \|\mathbf{t} - \mathbf{X} \mathbf{w}\|^2 \right) \tag{2}
\end{aligned}
$$

A notational reminder: As the data matrix $\mathbf{X}$ and the known precision parameter $\beta$ always appear in the set of conditioning variables, we have dropped the explicit $\mathbf{X}$ and $\beta$ from expressions such as the likelihood $\color{blue}{P(\mathbf{t} \mid \mathbf{X}, \mathbf{w}, \beta)}$ to keep the notation uncluttered.


## Conjugacy and Conjugate Prior

A **conjugate prior** is a prior distribution that, when combined with a specific likelihood, ensures that the posterior distribution belongs to the same family as the prior. We'll soon see that this property of "staying in the same family" greatly simplifies Bayesian computations!


*So, what would be the conjugate prior in our case?*

To answer this, let’s revisit the likelihood. As mentioned earlier, the likelihood $ \color{blue}{P(\mathbf{t} \mid \mathbf{w})} $ follows a Gaussian distribution.

To ensure conjugacy and simplify computation, we choose a multivariate Gaussian distribution as the prior for $ \mathbf{w} $. This choice leverages a key property of Gaussian distributions: *the product of two Gaussian distributions is also a Gaussian*.

Let’s now define the prior to set the stage for the analytical derivation that follows. 

The multivariate Gaussian prior is given by:

$$
\begin{aligned}
\color{orange}{P(\mathbf{w})} &= \mathcal{N}(\mathbf{w} \mid \mathbf{m}_0, \mathbf{S}_0) \\
&\propto \exp \left( -\frac{1}{2} (\mathbf{w} - \mathbf{m}_0)^\top \mathbf{S}_0^{-1} (\mathbf{w} - \mathbf{m}_0) \right) \tag{3}
\end{aligned}
$$

Here:
- $ \mathbf{m}_0 $ is the **mean vector** of the multivariate Gaussian prior of the the regression coefficients $\mathbf{w}$. <br>
$ \mathbf{m}_0 $ has a shape of $(D, 1)$, where $D$ is the number of features.
- $ \mathbf{S}_0 $ is the **covariance matrix** of the prior, with shape $ (D, D) $.

Both $ \mathbf{m}_0 $ and $ \mathbf{S}_0 $ should be selected based on prior knowledge or assumptions about the regression coefficients.

> **Note:** Do not confuse $\mathbf{S}_0$, the prior covariance matrix of the parameter $\mathbf{w}$, with $\beta$, which represents the precision (inverse variance) of the noise term in regression. In this framework, for simplicity, we assume $\beta$ is known. If it is unknown, an alternative conjugate framework, the Multivariate Normal-Wishart distribution, must be used.


## Conjugate Posterior



According to the Bayes formula, the $\color{green}{\textbf{posterior}}$ is proportional to the product of the $\color{blue}{\textbf{likelihood}}$ and $\color{orange}{\textbf{prior}}$:
<br>

$$
\color{green}{P(\mathbf{w} | \mathbf{t})} \propto \color{blue}{\exp \left( -\frac{\beta}{2} \|\mathbf{t} - \mathbf{X} \mathbf{w}\|^2 \right)} \color{orange}{\exp \left( -\frac{1}{2} (\mathbf{w} - \mathbf{m}_0)^T \mathbf{S}_0^{-1} (\mathbf{w} - \mathbf{m}_0) \right)} \tag{4}
$$

After expanding the terms in the exponents and completing the square, we obtain a posterior that's also a multivariate Gaussian:

$$
\color{green}{P(\mathbf{w} | \mathbf{t})} = \mathcal{N}(\mathbf{w} | \mathbf{m}_N, \mathbf{S}_N) \tag{5}
$$

In this formula,
$\mathbf{S}_N$ is the *covariance* of our posterior Gaussian distribution. <br> Its inverse (i.e. the  *precision* of the posterior) can be conveniently calculated from the prior and the data by simply using the formula below:

  $$
  \mathbf{S}_N^{-1} = \mathbf{S}_0^{-1} + \beta \mathbf{X}^T \mathbf{X} \tag{6}
  $$
  
We see that the formula lends itself to the following intution:

- The posterior precision $\mathbf{S}_N^{-1}$ is the sum of the prior-derived precision and data-derived precision.
- The prior precision $\mathbf{S}_0^{-1}$ reflects initial uncertainty in the weights $\mathbf{w}$.
- On the other hand, the data-derived precision $\beta \mathbf{X}^T \mathbf{X}$ reflects the improvement in precision coming from observing data $\mathbf{X}$, adjusted by the noise precision $\beta$.


Meanwhile, $\mathbf{m}_N$ is the mean of our posterior. It is calculated using the following formula:

  $$
  \mathbf{m}_N = \mathbf{S}_N \left( \mathbf{S}_0^{-1} \mathbf{m}_0 + \beta \mathbf{X}^T \mathbf{t} \right) \tag{7}
  $$

 We note that $\mathbf{m}_N$ is essentially a **weighted average** of the prior mean vector, $\mathbf{m}_0$, and the observed data (represented by $\mathbf{X}^T \mathbf{t}$). 

Another observation: if we set an infinitively broad (i.e. completely uninformative) prior with a zero lprecision $\mathbf{S}_0$, we see that the Bayesian posterior estimate reduces to the frequentist $\mathbf{w}_{\text{MLE}}$ obtained through the normal equation:



$$
\mathbf{S}_N^{-1} = \mathbf{0} + \beta \mathbf{X}^T \mathbf{X}  \tag{Assuming $\mathbf{S}_0^{-1} = \mathbf{0}$}
$$

$$
\mathbf{S}_N = \left( \beta \mathbf{X}^T \mathbf{X} \right)^{-1} \tag{Inversing $\mathbf{S}_N^{-1}$}
$$

$$
\begin{align}
\mathbf{m}_N &= \left( \beta \mathbf{X}^T \mathbf{X} \right)^{-1} \beta \mathbf{X}^T \mathbf{t} \tag{Substituting $\mathbf{S}_N^{-1}$ into $\mathbf{m}_N$} \\
&= \left( \mathbf{X}^T \mathbf{X} \right)^{-1} \mathbf{X}^T \mathbf{t} \tag{Normal Equation recovered!} \\
&= \mathbf{w}_{\text{MLE}} \notag
\end{align}
$$

## Posterior Predictive

Our ultimate goal is to get the posterior predictive distribution of a new target $t$ given a new input $\mathbf{x}$:

$$
p(t | \mathbf{x}, \mathbf{X}, \mathbf{t}) = \mathcal{N}(t | m(\mathbf{x}), s^2(\mathbf{x}))
$$

As the notation suggests, this distribution depends on the training data ($\mathbf{X}$ and $\mathbf{t}$) used to fit the model.

This posterior predictive distribution is a univariate Gaussian with mean $m(\mathbf{x})$ and variance $s^2(\mathbf{x})$, both of which depend on the given input $\mathbf{x}$.


### Predictive Mean

The predictive mean $m(\mathbf{x})$ is given by:
$$
\begin{aligned}
m(\mathbf{x}) &= \mathbf{x}^T \beta S_N \mathbf{X}^T \mathbf{t} \\
&= \mathbf{x}^T \mathbf{m}_N \tag{8}
\end{aligned}
$$

We note that this predictive mean is very simple: it is simply a projection of incoming data point $\mathbf{x}$ onto the posterior mean $\mathbf{m}_N$.


### Predictive Variance

The predictive variance $s^2(\mathbf{x})$ is given by:
$$
s^2(\mathbf{x}) = \beta^{-1} + \mathbf{x}^T S_N \mathbf{x} \tag{9}
$$


Predictive variance quantifies model confidence, increasing in regions far from the training data or in uncertain directions. From the formula, we note that this uncertainty depends on the **position** and **direction** of the incoming data point $\mathbf{x}$:

- If it's close to fitted training data: $\mathbf{x}^T S_N \mathbf{x}$ is small near training data, where the model is confident.
- On the other hand, if it is far from fitted training data: $\mathbf{x}^T S_N \mathbf{x}$ grows as $\mathbf{x}$ moves away, reflecting increased uncertainty.
- Lastly, larger $\|\mathbf{x}\|$ increases $\mathbf{x}^T S_N \mathbf{x}$, leading to higher variance.


# Application

The above framework is implemented by the `BayesianConjugateLinearRegressor` class from `skpro`. In this section, we'll take a look at its usage.

## Data

We will use the same synthetic dataset we generated in the first notebook.

As a reminder, the true parameter values are:
- Intercept $w_0$ = 1
- First regression coefficient $w_1$ = 2
- Second regression coefficient $w_2$ = 3
- Noise variance $\sigma$ = 0.5; in other words, noise *precision* $\beta$ is 4. These values are assumed to be known.

In [None]:
train_data = pd.read_csv("train_data.csv", index_col=0)
X_train = train_data[["x1", "x2"]]
y_train = train_data["y_train"]

test_data = pd.read_csv("test_data.csv", index_col=0)
X_test = test_data[["x1", "x2"]]

style_data(train_data)

## Instantiation

To instantiate a `BayesianConjugateLinearRegressor` model, we need to define a multivariate Gaussian prior for the regression coefficients using the following parameters:

1. **`coefs_prior_mu`**: The prior mean vector ($\mathbf{m}_0$). <br>
Suppose that based on prior knowledge, we estimate that the regression coefficients are likely around 4 and 5. (Note: as mentioned earlier, the true values of $w_1$ and $w_2$ are actually $2$ and $3$, respectively, so this assumption is slightly off). For the intercept, we'll (correctly) assume it to be close to 1. Hence, we select the prior mean vector as follows; note that the first element of the vector is the intercept:
   $$
   \mathbf{m}_0 = 
   \begin{bmatrix}
   1 \\\\ 
   4 \\\\ 
   5
   \end{bmatrix}
   $$

2. **`coefs_prior_cov`** <br>
 The prior covariance matrix ($\mathbf{S}_0$). To keep the model simple, we'll assume that the intercept and coefficients are independent and have equal variance. This leads us to select an identity matrix for **`coefs_prior_cov`**, expressed as:

$$
\mathbf{S}_0 = 
\begin{bmatrix}
1 & 0 & 0 \\\\ 
0 & 1 & 0 \\\\
0 & 0 & 1
\end{bmatrix}
$$

Additionally, we need to specify **`noise_precision`**, the known precision ($\beta$) of the Gaussian noise in the data. For this example, we'll assume we know the true noise precision value which is $4$. 


In [None]:
COEFS_PRIOR_MU = np.array(
    [
        [1.0],  # Prior for intercept and coefficients;
        [4.0],  # the 1st value is the intercept
        [5.0],
    ]
)  # the 2nd and 3rd are the mean priors for w1 and w2
COEFS_PRIOR_COV = np.eye(3)  # Covariance matrix which is Identity
NOISE_PRECISION = 4

model = BayesianConjugateLinearRegressor(
    coefs_prior_mu=COEFS_PRIOR_MU,
    coefs_prior_cov=COEFS_PRIOR_COV,
    noise_precision=NOISE_PRECISION,
)

Before fitting the model to our data, let's visualize the shape of our multivariate Gaussian prior for the coefficients $w_1$ and $w_2$.

In [None]:
COL_NAMES = ["Intercept", "w1", "w2"]
display(Math(r"\text{Mean of the prior for regression coefficents } (\mathbf{m}_0):"))
display(style_data(pd.DataFrame(COEFS_PRIOR_MU, columns=["Value"], index=COL_NAMES)))

display(
    Math(r"\text{Covariance of the prior for regression coefficents } (\mathbf{S}_0):")
)
display(style_data(pd.DataFrame(COEFS_PRIOR_COV, columns=COL_NAMES, index=COL_NAMES)))

TRUE_W1 = 2
TRUE_W2 = 3

# Plot the Gaussian distribution: Contour and 3D
fig = plt.figure(figsize=(14, 6))

# Generate a grid of points
x1, x2 = np.linspace(0, 8, 100), np.linspace(0, 8, 100)
X1, X2 = np.meshgrid(x1, x2)
pos = np.dstack((X1, X2))


def multivariate_gaussian(pos, mu, cov):
    # Helper function to compute the multivariate Gaussian PDF
    n = mu.shape[0]
    diff = pos - mu.flatten()
    inv_cov = np.linalg.inv(cov)
    exponent = -0.5 * np.einsum("...i,ij,...j->...", diff, inv_cov, diff)
    return (1.0 / np.sqrt((2 * np.pi) ** n * np.linalg.det(cov))) * np.exp(exponent)


# Extract the w1 and w2
Z = multivariate_gaussian(pos, COEFS_PRIOR_MU[1:], COEFS_PRIOR_COV[1:, 1:])

# Contour plot
ax1 = fig.add_subplot(121)
ax1.contour(X1, X2, Z, levels=10, cmap="viridis")
ax1.set_title("Contour Plot of the Bivariate Gaussian Prior")
ax1.set_xlabel("$w_1$")
ax1.set_ylabel("$w_2$")
ax1.plot(TRUE_W1, TRUE_W2, marker="*", color="red", markersize=10, label="True Value")
ax1.annotate(
    "True Value (2,3)",
    (TRUE_W1, TRUE_W2),
    textcoords="offset points",
    xytext=(5, -18),
    ha="center",
    color="red",
)
ax1.grid(True)

# 3D plot
ax2 = fig.add_subplot(122, projection="3d", box_aspect=(1, 1, 0.5))
ax2.plot_surface(X1, X2, Z, cmap="viridis", edgecolor="none", alpha=0.7)
ax2.scatter(TRUE_W1, TRUE_W2, 0, marker="*", color="red", s=100, label="True Value")
ax2.set_title("3D Plot of the Bivariate Gaussian Prior")
ax2.set_xlabel("$w_1$")
ax2.set_ylabel("$w_2$")
ax2.set_zticklabels([])

plt.tight_layout()
plt.show()

## Fitting

Like other estimators in the `skpro` family, the model is fitted easily with a single `.fit` call, which requires `X_train` and `y_train` as inputs.

If an intercept is required, a column of ones should be added to the feature matrix. This can be easily achieved using the `add_constant` function from the `statsmodels` library.

The `fit` method calculates the posterior using the conjugate formula elaborated above.

After fitting, the posterior becomes available through its parameters: `_coefs_posterior_mu` (mean) and `_coefs_posterior_cov` (covariance).

In [None]:
X_train_with_ones = sm.add_constant(
    X_train, prepend=True
)  # prepending a constant of ones
style_data(X_train_with_ones.head())

In [None]:
model.fit(X_train_with_ones, y_train)

As shown below, the posterior retains the same shape as the prior.

In [None]:
model._coefs_posterior_mu.shape == model._coefs_prior_mu.shape

In [None]:
# Access posterior mean and covariance
posterior_mu = model._coefs_posterior_mu.ravel()  # Flatten to match dimensions
posterior_cov = model._coefs_posterior_cov

display(Math(r"\text{Mean of the regression coefficients posterior } (\mathbf{m}_N):"))
display(
    style_data(
        pd.DataFrame(posterior_mu, columns=["Value"], index=["Intercept", "w1", "w2"])
    )
)

display(
    Math(r"\text{Covariance of the regression coefficents posterior } (\mathbf{S}_N):")
)
display(
    style_data(
        pd.DataFrame(
            posterior_cov,
            columns=["Intercept", "w1", "w2"],
            index=["Intercept", "w1", "w2"],
        )
    )
)

The visualization reveals that the posterior is narrower than the prior (indicating reduced uncertainty) and it is centered closer to the true value.

In [None]:
fig = plt.figure(figsize=(14, 6))

# Compute the posterior density of w1 and w2
Z = multivariate_gaussian(pos, posterior_mu[1:], posterior_cov[1:, 1:])

# Contour plot
ax1 = fig.add_subplot(121)
ax1.contour(X1, X2, Z, levels=10, cmap="viridis")
ax1.set_title("Contour Plot of the Bivariate Gaussian Posterior")
ax1.set_xlabel("$w_1$")
ax1.set_ylabel("$w_2$")
ax1.plot(TRUE_W1, TRUE_W2, marker="*", color="red", markersize=10, label="True Value")
ax1.annotate(
    "True Value (2, 3)",
    xy=(TRUE_W1, TRUE_W2),
    xytext=(50, 50),
    textcoords="offset points",
    ha="center",
    color="red",
    arrowprops=dict(arrowstyle="->", color="red", lw=1.5, alpha=0.5),
)
ax1.grid(True)

# 3D plot
ax2 = fig.add_subplot(122, projection="3d", box_aspect=(1, 1, 0.5))
ax2.plot_surface(X1, X2, Z, cmap="viridis", edgecolor="none", alpha=0.7)
ax2.scatter(TRUE_W1, TRUE_W2, 0, marker="*", color="red", s=100, label="True Value")
ax2.set_title("3D Plot of the Bivariate Gaussian Posterior")
ax2.set_xlabel("$w_1$")
ax2.set_ylabel("$w_2$")
ax2.set_zticklabels([])


plt.show()

## Update

One significant advantage of the Bayesian approach is its simplicity in handling updates, i.e., retraining the model when new data becomes available.

First, we'll generate additional synthetic training data.

As before, a column of ones must be prepended to the feature matrix to include the intercept.

In [None]:
# Generate a new random data point for X_train_update
np.random.seed(43)
N = 40
x1_new = np.random.uniform(0, 1, 40)
x2_new = np.random.uniform(0, 1, 40)

X_train_update = pd.DataFrame({"x1": x1_new, "x2": x2_new})
X_train_update_with_ones = sm.add_constant(
    X_train_update, prepend=True
)  # prepending a constant of ones

# Set the true relationship between the features and the target variable
TRUE_INTERCEPT = 1
TRUE_SLOPES = np.array([2, 3])
TRUE_SIGMA = 0.5

# Calculate y_true and y_train for the new data point
y_true_new = TRUE_INTERCEPT + np.dot(X_train_update, TRUE_SLOPES)
X_train_update["x0"] = 1
y_train_update = pd.Series(y_true_new + np.random.normal(0, TRUE_SIGMA, size=1))

We'll then perform the update with a single call to the `update` method.

The `update`method applies the same conjugacy framework, treating the posterior from the previous training as the new prior and updating it with the additional data.

In [None]:
model.update(X_train_update_with_ones, y_train_update)

The visualization below shows that the posterior after the update becomes even narrower and moves even closer to the true value.

In [None]:
# Access posterior mean and covariance
posterior_mu = model._coefs_posterior_mu.ravel()  # Flatten to match dimensions
posterior_cov = model._coefs_posterior_cov

display(Math(r"\text{Mean of the regression coefficents posterior } (\mathbf{m}_N):"))
display(
    style_data(
        pd.DataFrame(posterior_mu, columns=["Value"], index=["Intercept", "w1", "w2"])
    )
)

display(
    Math(r"\text{Covariance of the regression coefficents posterior } (\mathbf{S}_N):")
)
display(
    style_data(
        pd.DataFrame(
            posterior_cov,
            columns=["Intercept", "w1", "w2"],
            index=["Intercept", "w1", "w2"],
        )
    )
)


# Plot the posterior Gaussian distribution: Contour and 3D
fig = plt.figure(figsize=(14, 6))

# Compute the posterior density of w1 and w2
Z = multivariate_gaussian(pos, posterior_mu[1:], posterior_cov[1:, 1:])

# Contour plot
ax1 = fig.add_subplot(121)
ax1.contour(X1, X2, Z, levels=10, cmap="viridis")
ax1.set_title("Contour Plot of the Bivariate Gaussian Posterior after Update")
ax1.set_xlabel("$w_1$")
ax1.set_ylabel("$w_2$")
ax1.plot(TRUE_W1, TRUE_W2, marker="*", color="red", markersize=10, label="True Value")
ax1.annotate(
    "True Value (2,3)",
    (TRUE_W1, TRUE_W2),
    textcoords="offset points",
    xytext=(3, -28),
    ha="center",
    color="red",
)
ax1.grid(True)

# 3D plot
ax2 = fig.add_subplot(122, projection="3d", box_aspect=(1, 1, 0.5))
ax2.plot_surface(X1, X2, Z, cmap="viridis", edgecolor="none", alpha=0.7)
ax2.scatter(TRUE_W1, TRUE_W2, 0, marker="*", color="red", s=100, label="True Value")
ax2.set_title("3D Plot of the Bivariate Gaussian Posterior after Update")
ax2.set_xlabel("$w_1$")
ax2.set_ylabel("$w_2$")
ax2.set_zticklabels([])

plt.show()

## (Probabilistic) Prediction

With the fitted model, we can now make predictions. Below is the test dataset, `X_test`, which will be used for generating predictions:

In [None]:
style_data(X_test)

Probabilistic predictions are made using the `predict_proba` method, which requires `X_test` as inputs.

In [None]:
X_test_with_ones = sm.add_constant(
    X_test, prepend=True
)  # prepending a constant of ones
y_test_pred_proba = model.predict_proba(X_test_with_ones)
y_test_pred_proba

### Plotting posterior predictive PDF

The prediction output, `y_test_pred_proba`, is an instance of `skpro`'s `Normal` distribution, containing the same number of data points as `X_test`. 

A key advantage of this Bayesian estimator is that each prediction is a Normal distribution, making it straightforward to work with. 

For example, we can easily plot the probability density function of the predictions using `plot` method.

In [None]:
_ = y_test_pred_proba.plot("pdf")

### Predictive Credible Intervals

One great thing about using probabilities in predictions is how easily we can calculate **predictive credible intervals**. These are the Bayesian version of confidence intervals in frequentist statistics.

A credible interval gives you a range where the true prediction is likely to land, based on a specific level of certainty. For instance, a 95% predictive credible interval means there’s a 95% chance that the prediction will fall within that range.

The key difference from confidence intervals is how they’re interpreted. Confidence intervals are about long-term averages—how often the interval would contain the true value if you repeated the experiment a bunch of times. Credible intervals, on the other hand, are much more straightforward and interpretable: they tell you the probability of the prediction being in that range for the case you’re looking at right now.

The credible intervals are obtained using the model's `predict_interval` method, as shown below:

In [None]:
predictive_credible_interval = model.predict_interval(X_test_with_ones, coverage=0.8)
style_data(predictive_credible_interval)

### Point predictions

To obtain point predictions, we can use the `predict` method instead. This will return the median of the posterior predictive distributions described above.

In [None]:
y_test_pred = model.predict(X_test_with_ones)
style_data(y_test_pred)

## Advantages 

The main advantage of the conjugate prior frameworks such as the one we saw above is their **analytical simplicity**. When a conjugate prior is used, the posterior distribution can be found analytically. This eliminates the need for computationally intensive methods such as MCMC. This simplicity and the presence of closed form solution also extend to the derivation of posterior predictive distributions.

Another key benefit is **computational efficiency**. The availability of a closed-form solution for the posterior distribution makes the process fast. 

Finally, conjugate priors enhance **interpretability**. Since the prior, likelihood, and posterior share the same functional form, it is easier to see how prior beliefs combine with observed data to produce the posterior distribution. This transparency allows for deeper insights into how the data and prior assumptions influence the final results.

## Disadvantages

Conjugate priors suffer from **limited flexibility**. They impose constraints on the form of the prior distribution, and the chosen prior may not accurately reflect true prior knowledge about the parameters. To reiterate: conjugate priors only work when both the likelihood and prior belong to the same distributio family. Thus, if the likelihood and prior do not align, the conjugate strategy cannot be applied.

Additionally, conjugate priors are **inadequate for complex models**. Modern Bayesian models, such as neural networks or probabilistic graphical models, frequently require non-conjugate priors to capture intricate relationships between variables. In such cases, other inference techniques like MCMC or Variational Inference (VI) are necessary to approximate the posterior.

## Conclusion

While conjugate priors excel in providing analytical tractability and computational efficiency, they are inherently rigid. They are most effective for simple, structured models or situations where interpretability and fast computations are prioritized. 

# References

- [Bishop - Pattern Recognition and Machine Learning (2006)](https://www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf) - A comprehensive reference on machine learning theory and Bayesian methods by Christopher M. Bishop.