# Probability Foundations

https://machinelearningmastery.com/joint-marginal-and-conditional-probability-for-machine-learning/

## 1. Probability and Machine Learning

### There are three main sources of uncertainty in machine learning; they are:
    
* **Noise in observations**, e.g. measurement errors and random noise.

* **Incomplete coverage of the domain**, e.g. you can never observe all data.

* **Imperfect model of the problem**, e.g. all models have errors, some are useful.

### Uncertainty in applied machine learning is managed using probability.

* Probability and statistics help us to understand and quantify the **expected value and variability of variables** in our observations from the domain.

* Probability helps to understand and quantify the **expected distribution and density of observations** in the domain.

* Probability helps to understand and quantify the **expected capability and variance in performance of our predictive models** when applied to new data.

## 2. Joint, Marginal, and Conditional Probability

**Joint Probability**: The probability of simultaneous events.

* P(A and B) = P(A given B) * P(B)

**Marginal Probability**: The probability of an event irrespective of the other variables.

* P(X=A) = sum P(X=A, Y=yi) for all y

**Conditional Probability**: The probability of an event given the presence of another event.

* P(A given B) = P(A | B) = P(A and B) / P(B) 
    
    assuming P(B) =/= 0

These types of probability form the basis of much of predictive modeling with problems such as **classification and regression**. For example:

* The probability of a row of data is the **joint probability** across each input variable.

* The probability of a specific value of one input variable is the **marginal probability** across the values of the other input variables.

* The predictive model itself is an estimate of the **conditional probability** of an output given an input example.


Some additional notes:

* The calculation of the **joint probability** is sometimes called the fundamental rule of probability or the **“product rule” of probability** or the **“chain rule”** of probability. For example:

    **P(A1 and A2 and A3 and A4) = P(A4 | A3 and A2 and A1) * P(A3 | A2 and A1) * P(A2 | A1) * P(A1)**
    

* The **joint probability is symmetrical**, meaning that **P(A and B) is the same as P(B and A)**. The calculation using the conditional probability is also symmetrical, for example:

    **P(A and B) = P(A given B) * P(B) = P(B given A) * P(A)**
    

* The **marginal probability** is another important foundational rule in probability, referred to as the **“sum rule”**.

## 3. Probability of Independence and Exclusivity

### Independence

If one variable is not dependent on a second variable, this is called **independence or statistical independence**.

**Joint Probability for independent events**

* P(A and B) = P(A) * P(B)

**Marginal Probability for an independent event**

* P(A)

**Conditional Probability for independent events**

* P(A given B) = P(A)

    (probability of B has no effect)

We may be familiar with the notion of **statistical independence** from **sampling**. This assumes that one sample is unaffected by prior samples and does not affect future samples.

Many machine learning algorithms assume that samples from a domain are independent to each other and come from the same probability distribution, referred to as **independent and identically distributed, or i.i.d. for short**.

### Exclusivity

If the occurrence of one event excludes the occurrence of other events, then the events are said to be **mutually exclusive**.

**Joint Probability for mutually exclusive events**

* P(A and B) = 0.0

**Probability of mutually exclusive events**

* P(A or B) = P(A U B) = P(A) + P(B)

**Probability of non-mutually exclusive events**

* P(A or B) = P(A) + P(B) – P(A and B)