# Objective

This series of notebooks offers an in-depth exploration of the **Bayesian Linear Regression**. 
It includes a comparison with the frequentist approach to linear regression and an introduction to the most common Bayesian inference techniques.

The notebooks are organized as follows:

1. **Foundations of Linear Regression**  
   The first notebook (this one) lays the groundwork for understanding linear regression. It introduces the mathematical framework and provides an overview of frequentist approaches to estimate the model weights. This serves as a stepping stone for transitioning into the Bayesian perspective.


2. **Conjugate Prior in Bayesian Inference**  
   The second notebook delves into the concept of conjugate priors. By using conjugate distributions, we can derive posterior distributions analytically, simplifying Bayesian inference. This notebook also highlights how prior knowledge can influence the model and improve predictions in the presence of limited data.


3. **MCMC and Variational Inference**  
   The third and fourth notebooks introduce advanced methods for Bayesian inference, including Markov Chain Monte Carlo (MCMC) and Variational Inference. These techniques enable approximate inference when analytical solutions are infeasible, making it possible to handle complex models and large datasets effectively.

These notebooks are heavily inspired by the theories and notations presented in Christopher M. Bishop's classic textbook *[Pattern Recognition and Machine Learning](https://www.microsoft.com/en-us/research/people/cmbishop/prml-book/)*. 

# Imports and Helper Function

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.api as sm
from IPython.display import Math, display
from sklearn.model_selection import train_test_split
from utils import style_data

In [None]:
%load_ext autoreload
%autoreload 2

# Theory of Linear Regression

## Definition

Linear regression is a widely used model due to its simplicity and interpretability. 


In its simplest form, it predicts a single target $t$ as the deterministic output of the function $y$, which in turn is a linear combination of input variables $\mathbf{x} = (x_1, \dots, x_D)^\top$ and parameters $\mathbf{w} = (w_0, w_1, \dots, w_D)^\top$:

$$
t = y(\mathbf{x}, \mathbf{w}) = w_0 + \sum_{j=1}^D w_j x_j \tag{1}
$$


The parameter $w_0$, which is often called "bias" or "intercept", allows for any fixed offset in the data. It is often convenient to define an additional dummy feature $x_0 = 1$ so that we can simplify the above equation as:

$$
t =  y(\mathbf{x}, \mathbf{w}) = \mathbf{w}^\top \mathbf{x} \tag{2}
$$



## Likelihood

### Likelihood of a single target data point $t$

**Likelihood measures how well a statistical model explains the observed data, given its parameters.** Put another way, it expresses how the observed data could have been generated by the model’s assumed data-generating process.

Let’s revisit our earlier example. Initially, we assumed that the target $t$ is a deterministic output of $y(\mathbf{x}, \mathbf{w})$, the model's prediction based on the input features $\mathbf{x}$ and the regression coefficients $\mathbf{w}$. 

While this works in theory, it doesn’t account for the inherent uncertainty in real-world data.
To address this, we shall introduce uncertainty by assuming that $t$ is generated as the deterministic output $y(\mathbf{x}, \mathbf{w})$ combined with additive Gaussian noise $\epsilon$:

$$
\begin{aligned}
t &= y(\mathbf{x}, \mathbf{w}) + \epsilon \\
&= \mathbf{w}^\top \mathbf{x} \tag{3}
\end{aligned}
$$

where $\epsilon \sim \mathcal{N}(0, \beta^{-1})$ represents noise with zero mean and variance $\beta^{-1}$ (i.e. in $\beta$ is the *precision* of the Gaussian).


Alternatively, we can express  $t$ probabilistically by saying that it follows a Gaussian distribution, centered at $y(\mathbf{x}, \mathbf{w})$, with variance $\beta^{-1}$:

$$
\begin{aligned}
p(t | \mathbf{x}, \mathbf{w}, \beta) &= \mathcal{N}(t | y(\mathbf{x}, \mathbf{w}), \beta^{-1}) \\
&= \mathcal{N}(t | \mathbf{w}^\top \mathbf{x}, \beta^{-1}) \tag{4}
\end{aligned}
$$


### Likelihood of a set of data points $\mathbf{t}$


Now let's consider a data set of inputs $ \mathbf{X} = \{ \mathbf{x}_1, \ldots, \mathbf{x}_N \} $ with corresponding target values $ t_1, \ldots, t_N $. 

First, we group the scalar target variables $\{ t_n \}$ into a column vector that we denote by $\mathbf{t}$. 

Afterards, we assume that the data points in $\mathbf{t}$ are drawn independently from the above single-sample likelihood distribution. With this assumption, we proceed to construct the following expression for the likelihood function of the *entire dataset* $\mathbf{t}$:

$$
\begin{aligned}
p(\mathbf{t} | \mathbf{X}, \mathbf{w}, \beta) &= \prod_{n=1}^N \mathcal{N}(t_n | \mathbf{w}^T \mathbf{x}_n, \beta^{-1}) \\
&\propto \exp \left( -\frac{\beta}{2} \sum_{n=1}^N (t_n - \mathbf{w}^T \mathbf{x}_n)^2 \right)
\\
&\propto \exp \left( -\frac{\beta}{2} \|\mathbf{t} - \mathbf{X} \mathbf{w}\|^2 \right)
\end{aligned}  \tag{5}
$$


Note: since the data matrix $\mathbf{X} $ always appears in the set of conditioning variables, from this point onwards, we will drop the explicit "$\mathbf{X}$" from expressions such as $ p(\mathbf{t} | \mathbf{X}, \mathbf{w}, \beta) $ to keep the notation uncluttered.


### Log-likelihood


To better analyze the likelihood function, we often work with its *logarithm*. Taking the logarithm simplifies the product of probabilities into a sum, thus simplifying calculations and avoiding numerical underflow.

The **log-likelihood** of our model is given by:

$$
\begin{aligned}
\ln p(\mathbf{t} | \mathbf{w}, \beta) 
&= \sum_{n=1}^N \ln \mathcal{N}(t_n | \mathbf{w}^\top \mathbf{x}_n, \beta^{-1}) \\
&= \frac{N}{2} \ln \beta - \frac{N}{2} \ln(2\pi) - \beta E_D(\mathbf{w}) 
\end{aligned}\tag{6}
$$

The last component of the equation $E_D(\mathbf{w})$ is an important one - it is the **sum of squares error function** and is defined as:

$$
E_D(\mathbf{w}) = \frac{1}{2} \sum_{n=1}^N \big(t_n - \mathbf{w}^\top \mathbf{x}_n\big)^2. \tag{7}
$$

The log-likelihood above measures how well the model parameters $\mathbf{w}$ explain the observed data. 

To optimize a model, we should aim to **maximize the log-likelihood**. From **(6)**, we observe that the only component of the log-likelihood that depends on the parameters $\mathbf{w}$ is the sum of squared errors $E_D(\mathbf{w})$. Consequently, this **likelihood maximization** is mathematically equivalent to minimizing $E_D(\mathbf{w})$—that is, reducing the discrepancy between the observed target values $t_n$ and the predicted values $\mathbf{w}^\top \mathbf{x}_n$ as much as possible.

This connection explains why **Maximum Likelihood Estimation (MLE)** is also referred to as **Ordinary Least Squares (OLS)** in the context of linear regression with Gaussian noise. 

## Frequentist Approach of Obtaining $w$


To maximize the log-likelihood in the **frequentist framework**, we compute its gradient with respect to $\mathbf{w}$:

$$
\nabla \ln p(\mathbf{t} | \mathbf{w}, \beta) = \sum_{n=1}^N \big(t_n - \mathbf{w}^\top \mathbf{x}_n \big) \mathbf{x}_n^\top. \tag{8}
$$

Setting this gradient to zero gives the optimal weights $\mathbf{w}$ that minimize the sum-of-squares error:

$$
0 = \sum_{n=1}^N t_n \mathbf{x}_n^\top - \mathbf{w}^\top \sum_{n=1}^N \mathbf{x}_n \mathbf{x}_n^\top. \tag{9}
$$

Solving for $\mathbf{w}$, we obtain the famous **normal equation**:

$$
\mathbf{w}_{\text{MLE}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{t}, \tag{10}
$$

The term $(\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top$ is the **Moore-Penrose pseudoinverse** of $\mathbf{X}$; it serves as a generalization of the standard matrix inverse for cases where $\mathbf{X}$ may not be square or invertible. This pseudoinverse plays a key role in solving linear systems when $\mathbf{X}$ is over- or under-determined. It can be thought of as the matrix that provides the best possible approximation to an inverse in such scenarios, enabling us to project the observed data into the parameter space effectively.

This *frequentist* approach provides a closed-form solution for the maximum likelihood estimate of the regression weights, $\mathbf{w}_{\text{MLE}}$. While straightforward, it has key limitations:
- The estimate $\mathbf{w}_{\text{MLE}}$ is fixed and does not account for uncertainty in the parameters.
- It cannot incorporate prior knowledge, which becomes particularly problematic when data is scarce.

# Synthetic Data Generation

We will now demostrate the above theory by fitting a linear regeression on synthetic data. This synthetic data $\mathbf{x}$ have just two features ($x_1$ and $x_2$) and 25 data points. The true relationship between the data $\mathbf{x}$ and the target variable $y_{\text{true}}$ is given by the equation:


$$
\begin{equation}
y_{\text{true}} = w_0^{\text{true}} + x_1 \cdot w_1^{\text{true}} + x_2 \cdot w_2^{\text{true}} \tag{11}
\end{equation}
$$


where $w_0^{\text{true}} = 1$, $w_1^{\text{true}} = 2$, and $ w_2^{\text{true}} = 3$.

The observed target values ($y_{\text{train}}$) are generated by adding Gaussian noise to the true target values:

$$
\begin{equation}
y = y_{\text{true}} + \mathcal{N}(0, \sigma_{\text{true}}) \tag{12}
\end{equation}
$$

Here, $\sigma_{\text{true}} = 0.5$.


In [None]:
# Creating 25 random data points containing 2 features
N = 25
np.random.seed(42)
x1 = np.random.uniform(0, 1, N)
x2 = np.random.uniform(0, 1, N)
X = pd.DataFrame({"x1": x1, "x2": x2})

# Set the true relationship between the features and the target variable
TRUE_INTERCEPT = 1
TRUE_SLOPES = np.array([2, 3])
TRUE_SIGMA = 0.5

# Calculating the target variables:
# y_true (deterministic) and y_train (includes Gaussian noise)
y_true = TRUE_INTERCEPT + np.dot(X, TRUE_SLOPES)
y_train = y_true + np.random.normal(0, TRUE_SIGMA, size=len(X))

# Combine everything into a single DataFrame
data = pd.concat(
    [X, pd.Series(y_true, name="y_true"), pd.Series(y_train, name="y_train")],
    axis=1,
)
data = data.reset_index(drop=True)

# train test split and saving
train_data, test_data = train_test_split(data, test_size=5)
train_data.to_csv("train_data.csv")
test_data.to_csv("test_data.csv")

# Display the train_data DataFrame
style_data(test_data.head())

The line chart below plots the relationship between $x_1$, $x_2$ and the targets - both the theoretical $y_{\text{true}}$, represented by the red line, and the observed $y_{\text{train}}$ that contains Gaussian noise, represented by the blue dots.

In [None]:
# Fix feature1 and feature2 constants
x1_constant = train_data["x1"].mean()
x2_constant = train_data["x2"].mean()

# Recalculate the true target `y_true` for a constant x1
y_true_fixed_x1 = (
    TRUE_INTERCEPT + TRUE_SLOPES[0] * x1_constant + TRUE_SLOPES[1] * train_data["x2"]
)

# Recalculate the true target `y_true` for a constant x2
y_true_fixed_x2 = (
    TRUE_INTERCEPT + TRUE_SLOPES[0] * train_data["x1"] + TRUE_SLOPES[1] * x2_constant
)

# Set up the plot
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Plot feature1 vs. y_train with x2 constant
axes[0].scatter(
    train_data["x1"],
    train_data["y_train"],
    label="Observed `y_train` (containing noise)",
    alpha=0.6,
)
axes[0].plot(
    train_data["x1"],
    y_true_fixed_x2,
    color="red",
    label="Theoretical `y_true`",
    linewidth=2,
)
axes[0].set_xlabel("x1")
axes[0].set_ylabel("target")
axes[0].set_title(f"x1 vs target\n(x2 fixed at {x2_constant:.2f})")
axes[0].legend()

# Plot feature2 vs. y_train with x1 constant
axes[1].scatter(
    train_data["x2"],
    train_data["y_train"],
    label="Observed `y_train` (containing noise)",
    alpha=0.6,
)
axes[1].plot(
    train_data["x2"],
    y_true_fixed_x1,
    color="blue",
    label="Theoretical `y_true`",
    linewidth=2,
)
axes[1].set_xlabel("x2")
axes[1].set_title(f"x2 vs target\n(x1 fixed at {x1_constant:.2f})")
axes[1].legend()

# Improve spacing and show plot
plt.tight_layout()
plt.show()

We will also create synthetic **testing** data to evaluate the models' performance. The following code generates 10 new testing data points.

In [None]:
X_test = test_data[["x1", "x2"]]
style_data(X_test)

# Coding OLS Linear Regression

## From Scratch (Normal Equation)

Having generated the synthetic data, we are now ready to solve the problem.  

To begin, we will solve the linear regression problem "from scratch" by applying the normal equation to compute $\mathbf{w}_{\text{MLE}}$, the maximum likelihood estimate of the weights.

In [None]:
X_train = train_data[["x1", "x2"]]
y_train = train_data["y_train"]

# For simplicity, we'll use `X` and `y` to represent the final forms X_train and y_train
X = np.c_[np.ones(len(X_train)), X_train]  # Add a column of ones to the features
y = y_train.values.reshape(-1, 1)  # Reshape y to a column vector

# Applying the normal equation
X_pseudo_inverse = np.linalg.inv(X.T @ X) @ X.T
weights = X_pseudo_inverse @ y

# Extracting the intercept and slopes
intercept = weights[0, 0]
slopes = weights[1:].flatten()

# Calculating residuals and their standard deviation
y_pred = X @ weights
residuals = y - y_pred
estimated_sigma = residuals.std()


# ================== Reporting ==================

true_model_latex = rf"""
\text{{True data generating model:}} \\
y_{{\text{{true}}}} = {TRUE_SLOPES[0]:.2f} \cdot x_1 +
{TRUE_SLOPES[1]:.2f} \cdot x_2 + {TRUE_INTERCEPT:.2f} \\
\text{{True standard deviation: }} \sigma_{{\text{{true}}}} = {TRUE_SIGMA:.2f}\\
"""


estimated_model_latex_normal = rf"""
\text{{Estimated MLE model (using Normal equation):}} \\
\hat{{y}} = {slopes[0]:.2f} \cdot x_1 + {slopes[1]:.2f} \cdot x_2 + {intercept:.2f} \\
\text{{Standard deviation of residuals: }} \hat{{\sigma}} = {estimated_sigma:.2f}
"""

display(Math(true_model_latex))
display(Math(estimated_model_latex_normal))

We see that the estimated $\mathbf{w}_{\text{MLE}}$ is not too far of from the true data generating model.

## Using `statsmodels`

Manually coding the Normal Equation every time we solve a linear regression problem can quickly become tedious and error-prone.  

Thankfully, the **`statsmodels`** library simplifies this process, allowing us to fit an Ordinary Least Squares (OLS) model in just two lines of code.

In [None]:
X_train_with_ones = sm.add_constant(X_train)
ols_model = sm.OLS(y_train, X_train_with_ones).fit()

The code below extracts the parameter estimates from the `ols_model`.

Notice that the estimated slopes, intercept, and standard deviation closely match those derived from the Normal equation. This is expected since **`statsmodels`** leverages a variation of the Normal equation behind the scenes to compute these estimates.

In [None]:
# Predicted values for y_train
y_train_pred = ols_model.predict(X_train_with_ones)
residuals = y_train_pred - y_train

# ================== Reporting ==================

true_model_latex = rf"""
\text{{True data generating model:}} \\
y_{{\text{{true}}}} = {TRUE_SLOPES[0]:.2f} \cdot x_1 +
{TRUE_SLOPES[1]:.2f} \cdot x_2 + {TRUE_INTERCEPT:.2f} \\
\text{{True standard deviation: }} \sigma = {TRUE_SIGMA:.2f}\\
"""

estimated_model_latex = rf"""
\text{{Estimated MLE model:}} \\
\hat{{y}} = {ols_model.params.iloc[1]:.2f} \cdot x_1 +
{ols_model.params.iloc[2]:.2f} \cdot x_2 + {ols_model.params.iloc[0]:.2f} \\
\text{{Standard deviation of residuals: }} \hat{{\sigma}} = {residuals.std():.2f}\\
"""

# Displaying the results using LaTeX
display(Math(true_model_latex))
display(Math(estimated_model_latex))
display(Math(estimated_model_latex_normal))

## Test Prediction


Using the trained `ols_model`,  we can also create point predictions on the unseen `X_test` along with the corresponding confidence interval.

The latter provides a range within which we expect the true parameter to lie with a certain level of confidence (e.g., 95%).

In this code, we fix **`feature2`** at its mean value to isolate and visualize the influence of **`feature1`** on the target variable. 



In [None]:
# Fix feature2 at its mean
x2_constant = X_test["x2"].mean()

# Create a new test dataset with feature2 fixed at the constant value
X_test_fixed_x2 = X_test.copy()
X_test_fixed_x2["x2"] = x2_constant

# Predict y_test using the linear model after adding constant
predictions_fixed_x2 = ols_model.get_prediction(
    sm.add_constant(X_test_fixed_x2, has_constant="add")
)
pred_summary_fixed_x2 = predictions_fixed_x2.summary_frame(alpha=0.05)

# Extract predicted values and confidence intervals
y_test_pred_fixed_x2 = pred_summary_fixed_x2["mean"]
conf_int_lower_fixed_x2 = pred_summary_fixed_x2["obs_ci_lower"]
conf_int_upper_fixed_x2 = pred_summary_fixed_x2["obs_ci_upper"]

sorted_indices = np.argsort(X_test["x1"])
X_test_sorted = X_test["x1"].iloc[sorted_indices]
y_test_pred_sorted = y_test_pred_fixed_x2.iloc[sorted_indices]
conf_int_lower_sorted = conf_int_lower_fixed_x2.iloc[sorted_indices]
conf_int_upper_sorted = conf_int_upper_fixed_x2.iloc[sorted_indices]

# Plot the predictions with the confidence intervals
plt.figure(figsize=(10, 6))
plt.scatter(X_test_sorted, y_test_pred_sorted, color="blue", label="Predicted values")
plt.fill_between(
    X_test_sorted,
    conf_int_lower_sorted,
    conf_int_upper_sorted,
    color="lightblue",
    alpha=0.4,
    label="95% Confidence Interval",
)
plt.xlabel("x1")
plt.ylabel("Predicted Target")
plt.title(f"Predictions for X_test \n(x2 fixed at {x2_constant:.2f})")
plt.legend()
plt.show()

## Confidence Interval

The prediction above illustrates the frequentist concept of a **confidence interval (CI)**. A confidence interval provides a range of plausible values for an unknown parameter (such the regression coefficient) based on the observed data. For example, a 95% CI means that if we were to repeat the same experiment or sampling process many times, approximately 95% of the intervals constructed from those experiments would contain the true value of the parameter.

It’s important to note that a CI does not indicate the probability that the true parameter lies within the interval for a single sample. This distinction often leads to confusion, as people sometimes interpret CIs in a probabilistic way that is closer to Bayesian reasoning.

We will see later the Bayesian counterpart of CI, termed the **Bayesian credible intervals**, in contrast, do provide a probabilistic statement about the parameter itself. For instance, a 95% Bayesian credible interval directly states that there is a 95% probability that the true parameter value lies within the interval, conditioned on the data and the prior. 

# Bayesian Linear Regression

Now that we have a solid foundation in linear regression and the frequentist approach using Maximum Likelihood Estimation (MLE), let us shift our focus to Bayesian linear regression. Bayesian linear regression estimates parameters using two sources of information: prior beliefs on those parameters and observed data.




We will see that compared to OLS regression, Bayesian Linear Regression offers several key advantages:

1. **Incorporation of Prior Knowledge**: Bayesian regression allows you to incorporate prior beliefs about parameters, which can improve estimates, especially in cases where data is sparse.

2. **Uncertainty Quantification**: Unlike the frequentist approach which provides only single point estimates for model parameters, Bayesian linear regression outputs entire probability distributions. This allows for a richer understanding of parameter uncertainty and facilitates probabilistic predictions.


In this section, we will explore the theoretical framework used in Bayesian linear regression.


## Theory

Bayesian linear regression directly applies Bayes' Theorem to estimate the $\color{green}{\textbf{posterior distributions}}$ of the model parameters. 

As a reminder, here is Bayes' Theorem:

$$
\begin{align}
\color{green}{P(\mathbf{w} \mid \mathbf{t})} &= \frac{\color{blue}{P(\mathbf{t} \mid  \mathbf{w})} \times \color{orange}{P(\mathbf{w})}}{\color{purple}{P(\mathbf{t})}} \\
\color{green}{\text{posterior}} &= \frac{\color{blue}{\text{likelihood}} \times \color{orange}{\text{prior}}}{\color{purple}{\text{marginal likelihood}}}
\end{align}
\tag{13}
$$ 

Where:

- $\color{orange}{P(\mathbf{w})}$ is the $\color{orange}{\textbf{prior}}$ for parameters $\mathbf{w}$, reflecting our beliefs about $\mathbf{w}$ before observing the data $(\mathbf{t}, \mathbf{X})$. For instance, if we assume most predictors should have little influence, we can set $\color{orange}{P(\mathbf{w})}$ to be a Gaussian centered at zero with small standard deviation, which represents our high certainty that their values are close to zero. Technically, we can also have $\color{orange}{P(\beta)}$ - the prior for the precision parameter $\beta$ (inverse variance of the noise). However, to simplify our discussion, we'll assume $\beta$ is known, so this term is constant and can be ignored.  


- $\color{green}{P(\mathbf{w}\mid\mathbf{t}})$ represents the $\color{green}{\textbf{posterior}}$ for $\mathbf{w}$ given the observed data $(\mathbf{t}, \mathbf{X})$. It combines the prior and the likelihood and quantifies our belief about $\mathbf{w}$ *after* observing the data. For example, if we initially believe a slope $w_1$ should be small due to prior domain knowledge but the data $\mathbf{X}$ strongly suggests otherwise, the posterior would balance these two sources of information.



- $\color{blue}{P(\mathbf{t} \mid \mathbf{w})}$ is the $\color{blue}{\textbf{likelihood}}$ of the target data $\mathbf{t}$ given the input data $\mathbf{X}$, parameters $\mathbf{w}$, and noise precision $\beta$. It measures how well a particular set of parameters $\mathbf{w}$ explains the observed target values $\mathbf{t}$. As mentioned above, we'll assume that each data point $(\mathbf{x}_n, t_n)$ is drawn independently from a Gaussian.


- $\color{purple}{P(\mathbf{t})}$ is the $\color{purple}{\textbf{marginal likelihood}}$, which ensures that the posterior  sums to one. It integrates out all possible parameter values $\mathbf{w}$ from the joint probability. The marginal likelihood is particularly important for Bayesian model comparison because it quantifies how well a model as a whole explains the data independent of any specific parameter settings.

A notational reminder: As the data matrix $\mathbf{X}$ and the known precision parameter $\beta$ always appear in the set of conditioning variables, we have dropped the explicit $\mathbf{X}$ and $\beta$ from expressions such as the likelihood $\color{blue}{P(\mathbf{t} \mid \mathbf{X}, \mathbf{w}, \beta)}$ to keep the notation uncluttered.


## The Challenge of Bayesian Inference

As demonstrated above, Bayesian inference aims to determine the $\color{green}{\textbf{posterior}}$ $\color{green}{P(\mathbf{w} \mid \mathbf{t})}$ by applying Bayes' theorem. This posterior encapsulates our updated belief about the parameters $\mathbf{w}$ after observing the data $\mathbf{t}$ and $\mathbf{X}$.

However, in many real-world applications, computing the posterior directly is computationally challenging. The difficulty arises from the denominator in Bayes' formula—the $\color{purple}{\textbf{marginal likelihood}}$ $\color{purple}{P(\mathbf{t})}$—whose calculation requires integrating over the parameter space to marginalize $\mathbf{w}$ from the $\color{blue}{\textbf{likelihood}}$ $\color{blue}{P(\mathbf{t} \mid \mathbf{w})}$:

$$
\color{purple}{P(\mathbf{t})} = \int \color{blue}{P(\mathbf{t} \mid \mathbf{w})} \cdot \color{orange}{P(\mathbf{w})} \, d\mathbf{w}.
\tag{14}
$$

For high-dimensional parameter spaces, this integral is often analytically intractable due to its complexity. In such cases, we must rely on numerical strategies to approximate the posterior. 


Below, we'll introduce some strategies for obtaining/approximating the posterior. 


## Strategies for Bayesian Inference

To obtain or approximate the posterior distribution, we can employ several strategies. The most common approaches include **conjugate priors**, **Markov Chain Monte Carlo (MCMC)**, and **Variational Inference (VI)**, which are briefly introduced below.



### Conjugate Priors

A conjugate prior is a type of prior distribution that makes Bayesian inference much easier. When paired with a specific likelihood, this prior ensures the posterior distribution stays in the same family as the prior  (hence the name "conjugate"). This property of 'staying in the same family' means we can compute the posterior analytically without needing any numerical methods!

For example, in Bayesian linear regression with a Gaussian likelihood, if we were to use a Gaussian prior, the resulting posterior is also Gaussian. This property of "staying in the same family" greatly simplifies Bayesian computations and allows us to obtain an analytical form of the posterior. 

The second notebook in this series delves into conjugacy in more details.

### Markov Chain Monte Carlo (MCMC)

Markov Chain Monte Carlo (MCMC) is a group of algorithms designed to sample from the posterior. It works by building a Markov chain where the posterior serves as the equilibrium distribution. Once enough samples are generated, they can be used to numerically approximate the posterior and derive insights about the parameters.

We'll dive deeper into how MCMC works and its practical applications in the third notebook of this series.

###  Variational Inference (VI)


Variational Inference (VI) offers a clever way to approximate the $\color{green}{\textbf{posterior }P(\mathbf{w} \mid \mathbf{t})}$. Instead of using sampling, VI seeks $\color{salmon}\textbf{a simpler distribution  }{q(\mathbf{w})}$ from a predefined family that closely approximates posterior. This approximation is done by minimizing the **Kullback-Leibler (KL) divergence** (i.e. the difference) between $\color{salmon}{q(\mathbf{w})}$ and the posterior:

$$
\color{salmon}{q^*(\mathbf{w})} \color{black}{= \arg \min_{\color{black}{q}} \text{KL}(\color{salmon}{q(\mathbf{w})} \, \color{black}{\parallel} \, \color{green}{P(\mathbf{w} \mid \mathbf{t})}}\color{black}{)}.
$$

In short, VI turns Bayesian inference into a neat optimization problem, making it faster and more scalable.

We'll explore this approach in more depth in the fourth notebook of this series.

### Comparison of Strategies

| **Method**               | **Advantages**                                      | **Disadvantages**                                       |
|---------------------------|----------------------------------------------------|--------------------------------------------------------|
| **Conjugate Priors**      | Analytically tractable; computationally efficient  | Limited flexibility in prior/likelihood combinations   |
| **MCMC**                 | Flexible and can theoretically approximate any posterior  | Can be computationally expensive             |
| **Variational Inference** | Computationally efficient as it uses deterministic optimization | May underestimate uncertainty; <br> Result depends on the choice of approximant $q(\mathbf{w})$ |

Each method has trade-offs, and the choice of approach depends on the complexity of the model, the size of the data, and the computational resources available.
A more detailed discussion on these methods will be provided in subsequent notebooks.



# References

- [Bishop - Pattern Recognition and Machine Learning (2006)](https://www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf) - A comprehensive reference on machine learning theory and Bayesian methods by Christopher M. Bishop.



# Credits

Notebook creation: `meraldoantonio`

