# Overview

# Prerequisites


## Bayes Theorem

Bayes theorem sets up a mathematical equality which allows us to express the relationship between two jointly-distributed variables using conditional and marginal probabilities.

$$ P(A|B) = \frac{P(B|A)P(A)}{P(B)} $$

It's important to know that using the definition of conditional probability, the theorem can also be stated as:

$$ P(A|B) = \frac{P(B \cap A)}{P(B)} $$


The following terminology has been applied to the terms above:

- $P(A|B)$: Posterior probability
- $P(A)$: Prior probability
- $P(B|A)$: Likelihood
- $P(B)$: Evidence

This allows Bayes Theorem to be restated as:

$$ Posterior = \frac{Likelihood * Prior}{Evidence} $$

### Example: Smoke and Fire
Let A be the the event that a fire exists. Let B be the event that smoke exists.

We multiply the probability that fire creates smoke by the probability of fire. This gives us a measurement for how likely that a fire is currently producing smoke. We divide this by the probability that smoke currently exists. 

This scaling effectively brings things back into the terms of smoke, and as the equality states, gives us the probability that fire exists given smoke currently exists.



## Bayesian Statistics



### The Prior and Posterior Distributions
A prior distribution of a parameter is the probability distribution that represents your uncertainty about the parameter before the current data are examined. Multiplying the prior distribution and the likelihood function together leads to the posterior distribution of the parameter. You use the posterior distribution to carry out all inferences. You cannot carry out any Bayesian inference or perform any modeling without using a prior distribution.
https://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_introbayes_sect004.htm

The prior captures our ignorance regarding the true generating function (objective function)

Constraints on the prior -> smooth distribution... like terms are near eachother
https://ekamperi.github.io/mathematics/2021/03/30/gaussian-process-regression.html

Show plots with uncertainty bands
https://ekamperi.github.io/mathematics/2021/03/30/gaussian-process-regression.html

## Whats the connection with quantifying uncertainty... likelihood and hypothesis testing??

# History

# Definition

Bayesian Optimization uses a surrogate model 

A [surrogate model](https://en.wikipedia.org/wiki/Surrogate_model) is used to estimate a function/process of interest. The surrogate model is used when the actual outcomes cannot be directly observed.

A gaussian process is often elected as the surrogate process because it provides a measure of uncertainty for the estimations that the model provides. This uncertainty information is used to strike a balance between exploration and exploitation as the search space is searched for the optimal parameter set. Ie. We want to explore new areas of the search space we have not yet inspected while exploiting (continuing to search) the areas we are more certain we will observe good values.

https://brendanhasz.github.io/2019/03/28/hyperparameter-optimization.html

## Terminology
- **objective function** - the thing (function) being optimized
- **search space** - the possible inputs forthe objective function
- **surrogate model** - the model representing the objective function (see use cases)
- **aquisition function** - the method for selecting the next segment of the domain to explore (BO is an iterative process).
- **exploitation** - trying solutions are similar to other things that have already been proven to be good solutions
- **exploration** - trying solutions that are in unknown areas of the search space

good article: https://ekamperi.github.io/machine%20learning/2021/06/11/acquisition-functions.html

## The Basic Algorithm

Bayesian Optimization is an iterative process which is typically terminated by some threshold and/or exhaustion conditions.

Remake this image to talk about choosing a surrogate model and considering termination conditions

<center><img src='images/bayesian_optimization_algorithm_diagram.png' width='400px' height='400px'></center>

picture here: https://ekamperi.github.io/machine%20learning/2021/06/11/acquisition-functions.html

Bayesian optimization gets its name because it used bayesian statistics and bayes theorem. The term is generally attributed to Jonas Mockus and is coined in his work from a series of publications on global optimization in the 1970s and 1980s.

$$ P(A|B) = \frac{P(B|A)P(A)}{P(B)} $$

Rearranging this equation slightly we can see that we are calculating the conditional probability as the product of two other terms:

$$ P(A|B) = P(B|A) * \frac{P(A)}{P(B)} $$

We can choose to simplify the equation by considering one of the terms as an unneccessary scaling term. Removing it we have:

$$ P(A|B) = P(B|A) * P(A) $$

It is common to refer to these terms as the posterior probability, likelihood, and prior probability respectively.

Definition of priors: https://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_introbayes_sect004.htm


## Aquisition functions

- Upper Confidence Bound (UCB)
- Probability of Improvement (PI)
- Expected Improvement (EI)

https://ekamperi.github.io/machine%20learning/2021/06/11/acquisition-functions.html

More methods and great visual for comparison

https://distill.pub/2020/bayesian-optimization/

## Standard vs Exotic Bayesian Optimization Problems

Standard Bayesian optimization relies upon each {\displaystyle x\in A}x\in A being easy to evaluate, and problems that deviate from this assumption are known as exotic Bayesian optimization problems. Optimization problems can become exotic if it is known that there is noise, the evaluations are being done in parallel, the quality of evaluations relies upon a tradeoff between difficulty and accuracy, the presence of random environmental conditions, or if the evaluation involves derivatives.

https://en.wikipedia.org/wiki/Bayesian_optimization

# Example aquisition functions
- probability of improvement, 
- expected improvement, 
- Bayesian expected losses, 
- upper confidence bounds (UCB), 
- Thompson sampling and hybrids of these

https://en.wikipedia.org/wiki/Bayesian_optimization


# Compare brute force vs bayesisan ... cool visuals
https://ekamperi.github.io/machine%20learning/2021/05/08/bayesian-optimization.html

Active learning, visuals of the prior being updated
https://distill.pub/2020/bayesian-optimization/

## constraints
https://distill.pub/2020/bayesian-optimization/

# Use Cases
- dont have an analytic expression 
- functions without 1st or 2nd order statistics (bc. cannot use other methods)
- black box functions (objective function is unknown)
- functions difficult to observe
- functions are expensive to evaulate (lab experiments)
- Lipschitz-continuos ?? https://www.cs.uic.edu/~hjin/files/bayesian_opt.pdf
- no guarantee of convexity -> rules out methods from convexity field

# Examples

## Gaussian Process

### Choose a surrogate model
The gaussian process is a classic choice, but others exist

### Define "the prior" (distribution)

### Obtain the posterior (distribution)
Given the set of observations (function evaluations), use Bayes rule to obtain the posterior.

### Determine the next search parameters using aquisition funciton
Use an acquisition function \alpha(x)α(x), which is a function of the posterior, to decide the next sample point $x_t = \text{argmax}_x \alpha(x)$

### Check Terminiation Conditions

### Go To Step 2
Add newly sampled data to the set of observations and goto step #2 till convergence or budget elapses.