# Introduction to Probability and its Applications

## LSST DA Data Science Fellowship Program Session 23, Pittsburgh, PA
### Bryan Scott, CIERA/Northwestern University
Based on 
- Lectures by David Hunter and Hyungsuk Tak given at the 14th Penn State School in Astrostatistics
- Probability Theory: The Logic of Science by ET Jaynes

## Overview and Pedagogical Strategy

This section will cover:

- Terminology that statisticians use to discuss probability
- Core concepts of probability theory from an intuitive (not rigorous/first principles) perspective
- Axioms of probability theory
- Origins of Bayes relationship as a result in probability theory

By the end, you should be able to:

- Derive the Bayes relationship from "first principles"
- Make use of the relationships in probability theory to solve problems involving randomness
- Assign probabilities to events

My pedagogical strategy is:

- Start from simple claims and build more complex ideas out of them in a step by step fashion. 
- Introduce and define jargon multiple times. 

This will form the basis for:

- Tomorrow's discussion of priors and likelihood functions
- The application of the Bayes relationship to the practice of Bayesian inference (throughout the week)

## Part 1: Let's define some jargon

### We define the following terms:

- The **Outcome space**, denoted $\Omega$  is the set of possible **outcomes** of some process. We can write this as,

$\Omega = \{o_1, o_2, ... o_n\}$

So, for example, the outcome space for a coin is {H, T}, and for a 6-sided die, {1, 2, 3, 4, 5, 6}.

- An **Event** is a subset of the Sample Space. $E \in \Omega$.

Technical note: A **discrete sample space** is either **finite** or **countably infinite**. This definition helps with the formal mapping between outcomes in the sample space and the notion of probability. In astronomical applications, we largely ignore the technical issues that come with whether our outcome spaces are discrete or continuous ("not countably infinite"). In this case we need a more sophisticated notion of a **probability space**.

## An Important and Confusing Term: The "Random Variable"

A **random variable** is a **map** from the **outcome space** to the **real numbers**. There are a few notations for a random variable, for example:

- Formally, $X: \Omega \rightarrow R$ defines the map from the outcome space to (a subset?) of the real numbers 

- If we want to consider a specific sort of event, we write $\{\omega \in \Omega: x \in X\}$

- The short hand for the above is $\{X = x\}$

Careful, the random variable X is the **map**, not the specific **outcome or event**.

## Probabilities and Random Variables

The **probability mass function** associated with a **random variable** is a map between the elements of the outcome space, the **events**, and real numbers. It is written in shorthand as P(X = x) and more verbosely as, 

$P(\{\omega \in \Omega: x \in X\} = P(\{\omega \in \{H, T\}: X(\omega)) = T\})$ = 1/2

for the example of example of flipping a coin and getting tails is.

## How do we assign probabilities?

Take a moment and discuss this with those around you. 

### Principle of "Insufficient Reason" or "Indifference"

First formulated by Laplace and Bernoulli, Keynes (en route to criticizing it) defined this principle as:

"If there is no known reason for predicating of our subject one rather than another of several alternatives, then relatively to such knowledge the assertions of each of these alternatives have an equal probability."

So if I flip a coin, I assign uniform probability to all outcomes given by N/M (for N identical outcomes out of M possibilities) = 1/2 for {H, T}, 1/6 for flipping a coin {1, 2, 3, 4, 5, 6}. 

We will discuss this principle in the section on Priors and Likelihoods, I will also invite you to consider a physical justification for the principle as a challenge problem.

## Functions of random variables

Fact: Any function of a random variable is itself a random variable. 

Problem: We often measure some variable x, but the result we are interested n is a function y(x). What is the distribution P(y)? If y = $\Phi(x)$ and hence $x = \Phi^{-1}(y)$,

$$ p(y) = p[\Phi^{-1}(y)] \left|\frac{d \Phi^{-1}(y)}{d y} \right|$$

Some important remarks: 

* Cumulative statistics are invariant under monotonic transformations (they map to the same data point) - this provides the basis for a number of statistical tests that compare distributions. 

* The standard uncertainty propogation formulas are derived by a taylor expansion of the uncertainty to first order, 

$$ \sigma_y = \left| \frac{d \Phi(x)}{dx} \right| \sigma_x $$

Careful: these formulas only work if it is sufficient to keep only the first order terms in the transformation.

## Part 2: Kolmogorov Axioms of Probability Theory

There are three basic axioms that restrict the form that P can take:

$$P(\omega) \ge 0 \space \forall \space \omega \in \Omega$$ (probabilities are never negative)

$$\Sigma_i P(\omega_i) = 1$$ 

if the $\omega_i$ span the entire outcome space. (probabilities must sum to 1)

$$P(\cup_i^\infty \omega_i) = \Sigma P(\omega_i)$$ 

for disjoint $\omega_i$ (countable additivity or "the sum rule")

## Sum Rule Venn Diagram

## Part 3: A Quick Proof of the Bayes' rule: Conditional Probabilities

Suppose you have some event, which we will call $A$. We define the probability of event $A$ occurring as:

$$P(A).$$

Now suppose we want to know the probability that both event $A$ and event $B$ occur: $P(A \cap B)$. At first glance, it seems like this ought to be the product of the probability of $A$ and the probability of $B$:

$$P(A \cap B) = P(A)\,P(B).$$

This is the product rule if $A$ and $B$ are *independent*. 

To see why this is true, imagine a single coin. If event $A$ is a flip landing in heads, and event $B$ is a flip landing in tails, then $P(A)\,P(B) = 1/4$. 

What if P(A) depends on the P(B)?

In that case, the probability of $A$ *and* $B$ therefore requires a statement about conditional probability:

$$P(A \cap B) = P(A\mid{B})\,P(B),$$

which should be read as "the probability of $A$ and $B$ is equal to the probability of $A$ given $B$ multiplied by the probability of $B$."

## Product Rule Venn Diagram

The probability of $A$ and $B$ must be equal to the probability of $B$ and $A$, which leads to:

$$P(A\mid{B})\,P(B) = P(B\mid{A})\,P(A),$$

which we can rearrange as:

$$P(A\mid{B}) = \frac{P(B\mid{A})\,P(A)}{P(B)}.$$ (This is the Bayes' Rule)

### Conditionalization and the Law of Total Probability

First we define the concept of a **partition** of a set. A partition is a set of disjoint sets whose unions is the outcome space, $\Omega$.

Then, the law of total probability says that the probability of an event, A, can be found by summing over all of the ways A and events in the partition of $\Omega$ can occur, mathematically,

$$ P(A) = \Sigma_i^N P(A \cap B_i) $$

The definition of $P(A \cap B)$ allows us to write,

$$ P(A) = \Sigma_i^N P(A|B_i)P(B_i) $$

Example of the Law of Total Probability:  

$$ P(\text{H on four flips}) = P(H|\text{not trick}) \times P(\text{not trick}) + P(H |\text{trick}) \times P(\text{trick}) $$

$$ P(\text{H on four flips}) = \left(\frac{1}{2}\right)^4 \times \frac{5}{6} + 1 \times \frac{1}{6} = \frac{7}{32}$$

## Aside: Alternative Axioms:

What if instead of requiring that all probabilities: 

$$P(\omega) \ge 0 \space \forall \space \omega \in \Omega$$

We instead have the requirement that:  

$$P(\omega) \in \mathscr{R}  \space \forall \space \omega \in \Omega$$

and we retain one of the two remaining Kolmogorov axioms:

$$P(\cup_i^\infty \omega_i) = \Sigma P(\omega_i)$$ 

Take two minutes to discuss with your neighbor why you might prefer the Kolmogorov axioms to my alternative proposal.

We can interpret a negative probability the same way we interpret negative numbers - we never question that we can use the number "-5", for example of apples, as a calculational shorthand. We instead merely require that the final result of any calculation which concerns real physical apples results in a positive number of (or perhaps 0) apples. 

Negative probabilities would work similarly. We would require:

$$ P(A) = \Sigma_i^N P(A_i|B)P(B)  \ge 0 $$

<img src= "negative_prob_table.png" alt="drawing" width="700"/>

If we know that condition A happens 70% of the time and condition B 30% of the time, then

P(1) = 0.7* 0.3 + 0.3*(-0.4) = 0.09

P(2) = 0.7 * 0.6 + 1.2*(0.3) = 0.78 

P(3) = 0.7 * 0.1 + 0.3 * 0.2 = 0.13

and we therefore have a well defined final result. The probability P(1) + P(2) + P(3) = 1 as our second axiom required for these to be probabilities on the alternative axioms. 

(Note that A and B here do not have the same meaning as in the sum on the previous slide)

Disturbed? So am I. This example comes from a famous paper by [Richard Feynman](https://cds.cern.ch/record/154856/files/pre-27827.pdf/). It is worth a read.

## Part 4: Relationship Between Probability and Inference

As scientists, do we care about this probability? I would argue we are much more interested in the idea of **explanation**, which is what the Bayes' rule now allows us to attempt. We will **condition** our explanation on the **data** as follows:

$$ P(\text{trick} | \text{H on four flips}) = \frac{P(\text{H on four flips}| \text{trick})P(\text{trick})}{P(\text{H on four flips})} = \frac{16}{21}$$

which captures are intuition that, if we think the coin is rigged and the flips don't go our way, that we're probably being cheated! 

This example illustrates the relationship between probability and inference. 

$$ \text{Roughly: Probability explains how likely various outcomes (observations) are, given the model parameter }  \theta, \text{while inference quantifies the uncertainty about } \theta\text{, given observed data x.} $$

In classical statistics, we think of the problems we encounter in the following way: 

There exists a **population** from which we **sample** (select subsets of). We describe the sample with sets of descriptive **statistics**, for example, the sample **mean**, the sample **variance**, the sample **skeweness**, the sample **kurtosis**, etc.

We then use the **sample statistics** to do inference, that is, to estimate, or infer, the parameters of the unobserved **population probability distribution**. 

## Part 5: What is Probability?

If you review this lecture, you'll notice something disturbing. I haven't defined precisely what probability is. This is because there is, in fact, no consensus interpretation beyond the notion of maps and the Kolgomorov axioms. 

The basic debate comes down to the status of the parameters $\theta$. There are two perspectives: 

* The $\theta$ are fixed parameters to be estimated from (possibly) many repeated samples of the population. The sampling or the realization of the random process is the source of randomness. Our **estimates** have an associated probability distribution. Probability is thought of in terms of the long run frequency of events. 

* The data are fixed - produced by some underlying physical process. $\theta$ is some random variable with associated probability distributions $p(\theta)$ and $p(\theta | D)$. Probability is a measure of our uncertainty about $\theta$. 

The former interpretation is often called the classical or frequentist interpretation, owing to its focus on the notion of repeated sampling. The latter is called the Bayesian interpretation, after the Rev. Thomas Bayes, who first argued for it in a posthumous essay. 

Although you are (almost always) free to work within either interpretation, the dominant view in contemporary astronomy is a Bayesian one. 