In [None]:
'''
 * Copyright (c) 2016 Radhamadhab Dalai
 *
 * Permission is hereby granted, free of charge, to any person obtaining a copy
 * of this software and associated documentation files (the "Software"), to deal
 * in the Software without restriction, including without limitation the rights
 * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 * copies of the Software, and to permit persons to whom the Software is
 * furnished to do so, subject to the following conditions:
 *
 * The above copyright notice and this permission notice shall be included in
 * all copies or substantial portions of the Software.
 *
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 * THE SOFTWARE.
'''

## Probability and Distributions

Probability, loosely speaking, concerns the study of uncertainty. Probability can be thought of as the fraction of times an event occurs, or as a degree of belief about an event. We then would like to use this probability to measure the chance of something occurring in an experiment. As mentioned in Chapter 1, we often quantify uncertainty in the data, uncertainty in the machine learning model, and uncertainty in the predictions produced by the model.

Quantifying uncertainty requires the idea of a **random variable**, which is a function that maps outcomes of random experiments to a set of properties that we are interested in. Associated with the random variable is a function that measures the probability that a particular outcome (or set of outcomes) will occur; this is called the **probability distribution**.

Probability distributions are used as a building block for other concepts, such as probabilistic modeling (Section 8.4), graphical models (Section 8.5), and model selection (Section 8.6).

In the next section, we present the three concepts that define a **probability space** (the sample space, the events, and the probability of an event) and how they are related to a fourth concept called the random variable. The presentation is deliberately slightly hand-wavy since a rigorous presentation may occlude the intuition behind the concepts. An outline of the concepts presented in this chapter are shown in Figure 6.1 (not provided here).

##  Construction of a Probability Space

The theory of probability aims at defining a mathematical structure to describe random outcomes of experiments. For example, when tossing a single coin, we cannot determine the outcome, but by doing a large number of coin tosses, we can observe a regularity in the average outcome. Using this mathematical structure of probability, the goal is to perform automated reasoning, and in this sense, probability generalizes logical reasoning (Jaynes, 2003).

### Philosophical Issues

When constructing automated reasoning systems, classical Boolean logic does not allow us to express certain forms of plausible reasoning.

# Probability and Distributions (Continued)

### 6.1.1 Philosophical Issues (Continued)

When constructing automated reasoning systems, classical Boolean logic does not allow us to express certain forms of plausible reasoning. Consider the following scenario:

We observe that A is false. We find B becomes less plausible, although no conclusion can be drawn from classical logic. We observe that B is true. It seems A becomes more plausible. We use this form of reasoning daily.

We are waiting for a friend, and consider three possibilities:
* H1: she is on time;
* H2: she has been delayed by traffic; and
* H3: she has been abducted by aliens.

When we observe our friend is late, we must logically rule out H1. We also tend to consider H2 to be more likely, though we are not logically required to do so. Finally, we may consider H3 to be possible, but we continue to consider it quite unlikely. How do we conclude H2 is the most plausible answer?

Seen in this way, probability theory can be considered a generalization of Boolean logic. In the context of machine learning, it is often applied in this way to formalize the design of automated reasoning systems. Further arguments about how probability theory is the foundation of reasoning systems can be found in Pearl (1988).

> "For plausible reasoning it is necessary to extend the discrete true and false values of truth to continuous plausibilities" (Jaynes, 2003).

The philosophical basis of probability and how it should be somehow related to what we think should be true (in the logical sense) was studied by Cox (Jaynes, 2003). Another way to think about it is that if we are precise about our common sense we end up constructing probabilities.

E. T. Jaynes (1922–1998) identified three mathematical criteria, which must apply to all plausibilities:

1.  The degrees of plausibility are represented by real numbers.
2.  These numbers must be based on the rules of common sense.

---

## Probability Concepts Mind Map

**A mind map of the concepts related to random variables and probability distributions, as described in this chapter.**

* **Mean**
* **Variance**
* **Bayes' Theorem** (Chapter 9)
    * Product rule
    * Sum rule
* **Regression**
* **Random variable**
    * Example
    * Transformations & distribution
* **Gaussian** (Ex Chapter 10)
* **Dimensionality reduction**
* **Property**
* **Independence**
* **Bernoulli**
* **Sufficient statistics**
* **Conjugate** (Chapter 11)
* **Finite**
* **Density estimation**
* **Inner product**
* **Beta**
* **Exponential family**

*(Note: The actual visual "mind map" image is not directly rendered here, but its textual content is transcribed.)*

# Probability and Distributions (Continued)

### 6.1.1 Philosophical Issues (Continued)

3.  The resulting reasoning must be consistent, with the three following meanings of the word "consistent":
    (a) **Consistency or non-contradiction:** When the same result can be reached through different means, the same plausibility value must be found in all cases.
    (b) **Honesty:** All available data must be taken into account.
    (c) **Reproducibility:** If our state of knowledge about two problems are the same, then we must assign the same degree of plausibility to both of them.

The Cox–Jaynes theorem proves these plausibilities to be sufficient to define the universal mathematical rules that apply to plausibility $p$, up to transformation by an arbitrary monotonic function. Crucially, these rules are the rules of probability.

**Remark.** In machine learning and statistics, there are two major interpretations of probability: the Bayesian and frequentist interpretations (Bishop, 2006; Efron and Hastie, 2016). The Bayesian interpretation uses probability to specify the degree of uncertainty that the user has about an event. It is sometimes referred to as “subjective probability” or “degree of belief”. The frequentist interpretation considers the relative frequencies of events of interest to the total number of events that occurred. The probability of an event is defined as the relative frequency of the event in the limit when one has infinite data. $\diamondsuit$

Some machine learning texts on probabilistic models use lazy notation and jargon, which is confusing. This text is no exception. Multiple distinct concepts are all referred to as “probability distribution”, and the reader has to often disentangle the meaning from the context. One trick to help make sense of probability distributions is to check whether we are trying to model something categorical (a discrete random variable) or something continuous (a continuous random variable). The kinds of questions we tackle in machine learning are closely related to whether we are considering categorical or continuous models.

### 6.1.2 Probability and Random Variables

There are three distinct ideas that are often confused when discussing probabilities. First is the idea of a probability space, which allows us to quantify the idea of a probability. However, we mostly do not work directly with this basic probability space. Instead, we work with random variables (the second idea), which transfers the probability to a more convenient (often numerical) space. The third idea is the idea of a distribution or law associated with a random variable. We will introduce the first two ideas in this section and expand on the third idea in Section 6.2. Modern probability is based on a set of axioms proposed by Kolmogorov

## Probability and Random Variables (Continued)

(Grinstead and Snell, 1997; Jaynes, 2003) that introduce the three concepts of sample space, event space, and probability measure. The probability space models a real-world process (referred to as an experiment) with random outcomes.

### The Sample Space $\Omega$

The **sample space** is the set of all possible outcomes of the experiment, usually denoted by $\Omega$. For example, two successive coin tosses have a sample space of $\{\text{hh, tt, ht, th}\}$, where “h” denotes “heads” and “t” denotes “tails”.

## The Event Space $\mathcal{A}$

The **event space** is the space of potential results of the experiment. A subset $A$ of the sample space $\Omega$ is in the event space $\mathcal{A}$ if at the end of the experiment we can observe whether a particular outcome $\omega \in \Omega$ is in $A$. The event space $\mathcal{A}$ is obtained by considering the collection of subsets of $\Omega$, and for discrete probability distributions (Section 6.2.1) $\mathcal{A}$ is often the power set of $\Omega$.

## The Probability $P$

With each event $A \in \mathcal{A}$, we associate a number $P(A)$ that measures the probability or degree of belief that the event will occur. $P(A)$ is called the **probability** of $A$. The probability of a single event must lie in the interval $[0, 1]$, and the total probability over all outcomes in the sample space $\Omega$ must be 1, i.e., $P(\Omega) = 1$.

Given a probability space $(\Omega, \mathcal{A}, P)$, we want to use it to model some real-world phenomenon. In machine learning, we often avoid explicitly referring to the probability space, but instead refer to probabilities on quantities of interest, which we denote by $\mathcal{T}$. In this book, we refer to $\mathcal{T}$ as the **target space** and refer to elements of $\mathcal{T}$ as states. We introduce a function $X : \Omega \to \mathcal{T}$ that takes an element of $\Omega$ (an outcome) and returns a particular quantity of interest $x$, a value in $\mathcal{T}$. This association/mapping from $\Omega$ to $\mathcal{T}$ is called a **random variable**. For example, in the case of tossing two coins and counting the number of heads, a random variable $X$ maps to the three possible outcomes: $X(\text{hh}) = 2$, $X(\text{ht}) = 1$, $X(\text{th}) = 1$, and $X(\text{tt}) = 0$. In this particular case, $\mathcal{T} = \{0, 1, 2\}$, and it is the probabilities on elements of $\mathcal{T}$ that we are interested in.

For a finite sample space $\Omega$ and finite $\mathcal{T}$, the function corresponding to a random variable is essentially a lookup table. For any subset $S \subseteq \mathcal{T}$, we associate $P_X(S) \in [0, 1]$ (the probability) to a particular event occurring corresponding to the random variable $X$.

**The name “random variable” is a great source of misunderstanding as it is neither random nor is it a variable. It is a function.**

**Remark.** The aforementioned sample space $\Omega$ unfortunately is referred to by different names in different books. Another common name for $\Omega$ is “state space” (Jacod and Protter, 2004), but state space is sometimes reserved for referring to states in a dynamical system (Hasselblatt and Katok, 2003). Other names sometimes used to describe $\Omega$ are: “sample description space”, “possibility space,” and “event space”. $\diamondsuit$

We assume that the reader is already familiar with computing probabilities of intersections and unions of sets of events. A gentler introduction to probability with many examples can be found in chapter 2 of Walpole et al. (2011).

### Example 6.1

Consider a statistical experiment where we model a funfair game consisting of drawing two coins from a bag (with replacement). There are coins from USA (denoted as \$) and UK (denoted as £) in the bag, and since we draw two coins from the bag, there are four outcomes in total. The state space or sample space $\Omega$ of this experiment is then $\{(\$, \$), (\$, £), (£, \$), (£, £)\}$.

Let us assume that the composition of the bag of coins is such that a draw returns at random a \$ with probability 0.3. The event we are interested in is the total number of times the repeated draw returns \$. Let us define a random variable $X$ that maps the sample space $\Omega$ to $\mathcal{T}$, which denotes the number of times we draw \$ out of the bag. We can see from the preceding sample space we can get zero \$, one \$, or two \$s, and therefore $\mathcal{T} = \{0, 1, 2\}$.

The random variable $X$ (a function or lookup table) can be represented as a table like the following:

$X((\$, \$)) = 2 \quad \text{(6.1)}$
$X((\$, £)) = 1 \quad \text{(6.2)}$
$X((£, \$)) = 1 \quad \text{(6.3)}$
$X((£, £)) = 0 \quad \text{(6.4)}$

Since we return the first coin we draw before drawing the second, this implies that the two draws are independent of each other, which we will discuss in Section 6.4.5. Note that there are two experimental outcomes, which map to the same event, where only one of the draws returns \$. Therefore, the probability mass function (Section 6.2.1) of $X$ is given by:

$P(X = 2) = P((\$, \$)) = P(\$) \cdot P(\$) = 0.3 \cdot 0.3 = 0.09 \quad \text{(6.5)}$
$P(X = 1) = P((\$, £) \cup (£, \$)) = P((\$, £)) + P((£, \$)) = 0.3 \cdot (1 - 0.3) + (1 - 0.3) \cdot 0.3 = 0.42 \quad \text{(6.6)}$
$P(X = 0) = P((£, £)) = P(£) \cdot P(£) = (1 - 0.3) \cdot (1 - 0.3) = 0.49 \quad \text{(6.7)}$

# Probability and Random Variables (Continued)

seen so far. This analysis of future performance relies on probability and statistics, most of which is beyond what will be presented in this chapter. The interested reader is encouraged to look at the books by Boucheron et al. (2013) and Shalev-Shwartz and Ben-David (2014). We will see more about statistics in Chapter 8.

## 6.2 Discrete and Continuous Probabilities

Let us focus our attention on ways to describe the probability of an event as introduced in Section 6.1. Depending on whether the target space is discrete or continuous, the natural way to refer to distributions is different.

When the target space $\mathcal{T}$ is discrete, we can specify the probability that a random variable $X$ takes a particular value $x \in \mathcal{T}$, denoted as $P(X = x)$. The expression $P(X = x)$ for a discrete random variable $X$ is known as the **probability mass function**.

When the target space $\mathcal{T}$ is continuous, e.g., the real line $\mathbb{R}$, it is more natural to specify the probability that a random variable $X$ is in an interval, denoted by $P(a \leq X \leq b)$ for $a < b$. By convention, we specify the probability that a random variable $X$ is less than a particular value $x$, denoted by $P(X \leq x)$. The expression $P(X \leq x)$ for a continuous random variable $X$ is known as the **cumulative distribution function**. We will discuss continuous random variables in Section 6.2.2. We will revisit the nomenclature and contrast discrete and continuous random variables in Section 6.2.3.

**Remark.** We will use the phrase **univariate distribution** to refer to distributions of a single random variable (whose states are denoted by non-bold $x$). We will refer to distributions of more than one random variable as **multivariate distributions**, and will usually consider a vector of random variables (whose states are denoted by bold $\mathbf{x}$). $\diamondsuit$

### 6.2.1 Discrete Probabilities

When the target space is discrete, we can imagine the probability distribution of multiple random variables as filling out a (multidimensional) array of numbers. Figure 6.2 (not provided here) shows an example. The target space of the joint probability is the Cartesian product of the target spaces of each of the random variables. We define the **joint probability** as:

$$P(X = x_i, Y = y_j) = \frac{n_{ij}}{N} \quad \text{(6.9)}$$

Where $n_{ij}$ is the number of events with state $x_i$ and $y_j$ and $N$ the total number of events. The joint probability is the probability of the intersection of both events, that is, $P(X = x_i, Y = y_j) = P(X = x_i \cap Y = y_j)$. Figure 6.2 illustrates the probability mass function (pmf) of a discrete probability distribution. For two random variables $X$ and $Y$, the probability

In [1]:
# --- 1. Define the Sample Space (Ω) ---
# The set of all possible outcomes of the experiment.
# Each outcome is a tuple representing (coin1, coin2)
sample_space = [
    ('$', '$'),
    ('$', '£'),
    ('£', '$'),
    ('£', '£')
]

print("--- 1. Sample Space (Ω) ---")
print(f"Ω = {sample_space}")
print(f"Number of outcomes in Ω: {len(sample_space)}")
print("-" * 30)

# --- 2. Define the Probability Measure (P) ---
# We need to assign a probability to each elementary outcome in Ω.
# The text states a draw returns '$' with probability 0.3.
# Let P_dollar = 0.3, P_pound = 1 - 0.3 = 0.7
# Since draws are independent (as stated in the text):
P_dollar = 0.3
P_pound = 1 - P_dollar

probability_measure = {
    ('$', '$'): P_dollar * P_dollar,      # 0.3 * 0.3 = 0.09
    ('$', '£'): P_dollar * P_pound,      # 0.3 * 0.7 = 0.21
    ('£', '$'): P_pound * P_dollar,      # 0.7 * 0.3 = 0.21
    ('£', '£'): P_pound * P_pound       # 0.7 * 0.7 = 0.49
}

print("--- 2. Probability Measure (P) on Ω ---")
for outcome, prob in probability_measure.items():
    print(f"P({outcome}) = {prob:.2f}")

# Verify total probability sums to 1
total_prob_omega = sum(probability_measure.values())
print(f"Sum of probabilities P(Ω) = {total_prob_omega:.2f}")
if abs(total_prob_omega - 1.0) < 1e-9:
    print("Probabilities sum to 1.0 (consistent).")
else:
    print("Warning: Probabilities do not sum to 1.0.")
print("-" * 30)

# --- 3. Define a Random Variable (X) ---
# X: Ω → T, where T is the target space (number of '$' coins drawn).
# T = {0, 1, 2}
# The random variable X maps outcomes to the number of '$' coins.
# This can be represented as a dictionary (lookup table) or a function.

def random_variable_X(outcome):
    """
    Random variable X: counts the number of '$' in an outcome.
    e.g., X(('$', '$')) = 2
    """
    count = 0
    for coin in outcome:
        if coin == '$':
            count += 1
    return count

# Let's show the mapping for each outcome in Ω
print("--- 3. Random Variable X: Ω → T (Number of '$' coins) ---")
target_space = set() # To collect unique values in T
random_variable_mapping = {} # Store the mapping for clarity

for outcome in sample_space:
    value_in_T = random_variable_X(outcome)
    random_variable_mapping[outcome] = value_in_T
    target_space.add(value_in_T)

for outcome, value in random_variable_mapping.items():
    print(f"X({outcome}) = {value}")
print(f"Target Space (T) = {sorted(list(target_space))}")
print("-" * 30)

# --- 4. Calculate the Probability Mass Function (PMF) of X ---
# P_X(S) = P(X ∈ S) = P(X⁻¹(S)) = P({ω ∈ Ω : X(ω) ∈ S})
# We want to find P(X=0), P(X=1), P(X=2)

pmf_X = {}
for value_in_T in sorted(list(target_space)):
    # Find all outcomes in Ω that map to this value in T (the pre-image X⁻¹(value_in_T))
    pre_image_outcomes = []
    for outcome in sample_space:
        if random_variable_X(outcome) == value_in_T:
            pre_image_outcomes.append(outcome)
    
    # Sum the probabilities of these pre-image outcomes
    probability_for_value = 0.0
    for omega_event in pre_image_outcomes:
        probability_for_value += probability_measure[omega_event]
    
    pmf_X[value_in_T] = probability_for_value

print("--- 4. Probability Mass Function (PMF) of X (P_X) ---")
for value, prob in pmf_X.items():
    print(f"P(X = {value}) = {prob:.2f}")

# Verify the sum of PMF probabilities is 1
total_pmf_prob = sum(pmf_X.values())
print(f"Sum of PMF probabilities = {total_pmf_prob:.2f}")
if abs(total_pmf_prob - 1.0) < 1e-9:
    print("PMF probabilities sum to 1.0 (consistent).")
else:
    print("Warning: PMF probabilities do not sum to 1.0.")
print("-" * 30)

print("\n--- Example 6.1 Verification ---")
print("From the text:")
print("P(X = 2) = 0.09")
print("P(X = 1) = 0.42")
print("P(X = 0) = 0.49")
print("\nOur calculated PMF matches the example values, demonstrating the concepts.")

--- 1. Sample Space (Ω) ---
Ω = [('$', '$'), ('$', '£'), ('£', '$'), ('£', '£')]
Number of outcomes in Ω: 4
------------------------------
--- 2. Probability Measure (P) on Ω ---
P(('$', '$')) = 0.09
P(('$', '£')) = 0.21
P(('£', '$')) = 0.21
P(('£', '£')) = 0.49
Sum of probabilities P(Ω) = 1.00
Probabilities sum to 1.0 (consistent).
------------------------------
--- 3. Random Variable X: Ω → T (Number of '$' coins) ---
X(('$', '$')) = 2
X(('$', '£')) = 1
X(('£', '$')) = 1
X(('£', '£')) = 0
Target Space (T) = [0, 1, 2]
------------------------------
--- 4. Probability Mass Function (PMF) of X (P_X) ---
P(X = 0) = 0.49
P(X = 1) = 0.42
P(X = 2) = 0.09
Sum of PMF probabilities = 1.00
PMF probabilities sum to 1.0 (consistent).
------------------------------

--- Example 6.1 Verification ---
From the text:
P(X = 2) = 0.09
P(X = 1) = 0.42
P(X = 0) = 0.49

Our calculated PMF matches the example values, demonstrating the concepts.
