# Axioms of Probability

There are different versions of the "axioms of probability" but we'll use the **Kolgomorov Axioms of Probability** because they're straightforward. They define the basic properties of a **probability metric**. 

Probability is defined over outcomes (we'll talk about events later). Outcomes are, more or less, mutually exclusive observations about a fixed domain, which can generally include the *properties* of objects. We've already discussed the coin flip, `W = {"heads", "tails"}`. If we were talking about customers, we might have an outcome space of `W = {"purchase", "not purchase"}`. If a customer purchases something, we can talk about an outcome space of where they live, `S = {"AL", "AK", "AR", ..., "WY"}` and their gender, `W = {"male", "female"}` (very few businesses account for non-binary identification). The customer's state of residence and gender are properties of the person. Finally, we can talk about joint outcome spaces as the Cartesian product of simple outcome spaces, `S x G = {("AL", "male"), ("AL", "female"), ("AK", "male"), ("AK", "female"),...}`.

We will define different types of probability later but for now we take `P(W)` to mean "the probability of". The argument to `P(W)`, `W`, is an outcome space, such as state or gender. `P(W)` can be thought of as a assignment of probability to each element of `W` (as a table of probability values or perhaps a function). `P(W="male")` returns the probability of the single outcome in W, `W="male"`. Following the conventional notation, when confusion is unlikely, we can also write `P(male)` to mean the same as `P(W="male")`.

If we let $W$ be the set of all possible outcomes and $w$ be some particular outcome (which are considered to be *atomic* or mutually exclusive) then the axioms of probability are:

1. $P(w) \geq 0$
2. $P(W) = \sum_i P(w_i) = 1$
3. $P(w_i \cup w_j) = P(w_i) + P(w_j)$

The first axiom states that a probability, a degree of certainty, must be non-negative. While one can certainly think of an interpretation of a negative probability (perhaps our certainty against an event happening), it is cleaner to think of all probabilities as being non-negative.

The second axiom says that we must be 100% certain in all the outcomes in $W$ taken together. At minimum, one of them must happen. It also constrains probabilities to be on the range $(0, 1)$. We can always convert an "improper" probability distribution into a proper probability distribution by *normalizing*: dividing all "improper" probabilities through by the sum total. 

In fact, we may often do this on purpose when eliciting probabilities from people. If we assign a certainty value of 1 to an outcome A, we can ask someone if they feel that outcome B is twice as likely to happen (in which case it would get a value of 2), or half as likely to happen (in which case it would get a 1/2). We can then go through and normalize.

The third axiom says that the probability of $w_i$ *or* $w_j$ is equal to the sums of the their individual probabilities, $P(w_i)$ or $P(w_j)$. If plausibility in rain tomorrow is 0.23 and our degree of belief in snow tomorrow is 0.10 then our degree of belief--probability--in either rain or snow tomorrow must be 0.33. This axiom is known as the **Additive Law of Probability**. Note that this is only true because we're talking about atomic outcomes.

It is sometimes the case that the axioms are defined in terms of **events**. An event is a generalization of an outcome and may contain several outcomes. Rolling a 1 on a six-sided die is an outcome. Rolling an odd number on a six sided die is an event. Note that rolling a 1 on a six-sided die is *also* an event which can lead to some confusion.

If we consider events, then the 3rd Axiom becomes:

3a. $P(E_i \cup E_j) = P(E_i) + P(E_j) - P(E_i \cap E_j)$

Why the difference? The first version of the axiom is defined for outcomes which must be mutually exclusive. The second version is defined for events which may not be mutually exclusive. Events can be collections of outcomes.

For example, $W$ might be the sides of a six-sided die $W = {1, 2, 3, 4, 5, 6}$. These are the possible *outcomes*. In contrast, $E_1$ might be "all even valued sides of the die" and $E_2$ might be "all sides whose value is < 4" which are *events*. While the outcomes in $W$ are mutually exclusive, the *events* in $E_1$ and $E_2$ are not (they have the value 2 in common). If follows that all outcomes are events but not vice versa.

The **power set** of a set $X$ is the set of sets generated by combining all possible elements of X into sets. For example, {1, 2, 3} has the power set [{1, 2, 3}, {1, 2}, {1, 3}, {2, 3}, {1}, {2}, {3}, {}]. You can think of events as coming from the power set over the set of all individual outcomes with only some of them being of any interest.

If $W$ defines the set of outcomes, then the power set defines all possible events. Note that all outcomes are events but not all events are outcomes. So the power set of {All US States} will include {AK, HI}, {AR}, {ME, NM}, {CA, AZ, OR, WA, NV}. Some of these may be of interest to us and some may not and some of these sets have names such as "Continental US" or "Western States". The upshot is that all outcomes are by definition mutually exclusive but events are not. When working with events, we must use the modified versions of Axioms #2 and #3.

We will usually refer to events and need only worry about whether or not they're atomic or mutually exclusive. This meshes nicely with the language of software engineering in general and logging specifically.

## Notation

Probability notation can get a bit crazy. We've already mentioned that $P()$ can mean a lot of different things:

1. $P(A)$ - is the probability **distribution** over the outcomes/events of A. It is most likely a table of events and probability values.
2. $P(A=a_1)$ - is the **probability** that A takes on the value a1. It's a single probability value.
3. $P(a_1)$ - when the context is not ambiguous, this is a shorthand for the above. It is a single probability value.

What if there's more than one variable?

1. $P(A, B)$ - is the probability distribution over the *Cartesian product* of outcomes/events in A and B. If A = {a1, a2, a3} and B = {b1, b2} then $P(A, B)$ returns a single probability value for each of (a1, b1), (a1, b2), (a2, b1), (a2, b2), (a3, b1), (a3, b2).
2. $P(A|B)$ - is actually *multiple* probability distributions. The $|B$ part is read as "given B". This is known as a *conditional probability* which we'll talk about later. There is one probability distribution for each possible value of B. So if B = {b1, b2, b3, b4}, $P(A|B)$ is actually *four* probability distributions.

Operations on probability distributions are kind of like joins in database queries:

1. $P(A)P(A)$ - This is an outer join of A (cartesian product) with itself. $P(a1)$ * $P(a1)$, $P(a1)$ * $P(a2)$, $P(a2)$ * $P(a1)$, $P(a2)$ * $P(a2)$.
2. $P(A)P(B)$ - This is an outer join of A with B.
3. $P(A|B)P(B)$ - This an *inner* join. Remember that $P(A|B)$ is actually multiple probability distributions. This means that if B = {b1, b2} then we have $P(A|B=b1)P(B=b1)$, $P(A|B=b2)P(B=b2)$. In this case, B acts as the "foreign key" between the two sets.

As you work through some examples later, you'll get a better feel for how it all hangs together.