# Chapter 3: Mathematical foundations of statistics

[introduction]

## 3.1 Sample Spaces

To have a better understanding of probabilities, we use a **sample space**.

**Definition 3.1:**

1. The sample space $\Omega$ is the set of all possible outcomes of an experiment.
2. Elements in the sample space are called **outcomes**, and are written in short as $\omega \in \Omega$.
3. Subsets of the sample space are called **events**, and are written in short as $A \subset \Omega$.

Let's start with a simple example: you and your friend just finished the exam for this course. There are two outcomes for each of you, passing $P$ and failing $F$. Then, the set of all possible outcomes is:
$$ \Omega = \{PP, PF, FP, FF\} $$

Now, you are going to a café to celebrate the end of the exam. There are two possible events: having a something to drink ($A$) and having some cake ($B$).

To visualize events, we like to use Venn diagrams. The full rectangle represents the sample space, and the circles represent the events. The shaded areas show the outcome. 

**Definition 3.2:**  
For a given event $A$, let $A^c = \{ \omega \in \Omega : \omega \notin A \}$. This is called the **complement** of $A$, and is the event 'not $A$'.

The first possible event, is that you do not buy a drink:  $\{ \omega \in \Omega : \omega \notin A \}$. In this case, you could buy a cake, or nothing.  

![ss_notA](figures/ss_notA.png)





It is also possible that you do not buy a drink, but also no cake. This is written as $\{ \omega \in \Omega : \omega \notin A  : \omega \notin B \}$. This is the **complement** of A **and** B.

![ss_notAnorB](figures/ss_notAnorB.png)

**Definition 3.3:**  
The **union** of the events $A$ and $B$ is the event "$A$ or $B$" and is defined as:
$$
A \cup B = \{ \omega \in \Omega : \omega \in A \text{ or } \omega \in B \text{ or both} \}
$$
Here "or" is non-exclusive, meaning that $A$ and $B$ can happen separately or simultaneously.

In our example, this would mean that you are having cake, or both a drink and cake.
![ss_AorB](figures/ss_AorB.png)


**Definition 3.4:**
The **intersection** of A and B is defined as:
$$
A \cap B = \{ \omega \in \Omega : \omega \in A \text{ and } \omega \in B \text{ simultaneously} \}
$$

In our example, this would mean that you are having both a drink and some cake.
![ss_AandB](figures/ss_AandB.png)

**Definition 3.5:**
The **difference** in A and B is the event "A but not B". This is defined as:
$$
A \backslash B = \{ \omega \in \Omega : \omega \in A \text{ and } \omega \notin B \}
$$

![ss_AbutnotB](figures/ss_AbutnotB.png)

The café has three cakes: chocolate, apple and lemon.
The event that you have some cake is A. The event that you have chocolate cake is B.
The event that you have some cake, that is not chocolate is: $A \backslash B = \{ \text{ chocolate, apple, lemon } \} \backslash \{ \text{ chocolate }\} = \{ \text{ apple, lemon } \}$

## 3.2 Derivation of Bayes theorem


**Definition 3.6:**
    
The **Bayes theorem** is a mathematical tool that is used to find the conditional probability. This is the likelihood of a certain event happening based on previous outcomes in similar situations. To find the formula for the Bayes theorem we combine two equations. 
The first equation calculates the probability of two events happening:
\begin{equation}
P(A \cap B) = P(A)P(A | B)
\label{eq:probability_one} \tag{1}
\end{equation}
Equation \eqref{eq:probability_one} says that the probability of both A and B happening depends on the probability of A happening multiplied 
by the probability of A happening when we already know that B is happening.

Similarly, the probability of both A and B happening can also be calculated with equation \eqref{eq:probability_two}:

\begin{equation}
P(A \cap B) = P(B)P(B | A)
\label{eq:probability_two} \tag{2}
\end{equation}

Combining these equations gives:

\begin{equation}
P(A)P(A | B) = P(B)P(B | A)
\tag{3}
\end{equation}
Which means

\begin{equation}
P(A | B) = \frac{P(A)P(B | A)}{P(B)}
\label{eq:bayes_theorem} \tag{4}
\end{equation}

Equation \eqref{eq:bayes_theorem} is called the Bayes theorem. To explain this further we will use the following example. Imagine if you want to know the probability of a person named Bob being an astronomy student.

B is the prior here, which is that the person is named Bob and A is the person being an astronomy student. 
To find this, you thus need to know the probability of a person being named Bob. 
Then you also want to know the probability of a person being an astronomy student. Finally, you want to know the probability of a person being named Bob when you know they are an astronomy student.
You can for example find this by looking up the names of people that study astronomy and see how many of those people are named Bob.

So now when you meet someone named Bob you know the probability of them being an astronomy student.

Reading this example you might not think the Bayes theorem is that groundbreaking. However, one of many important implication of the
Bayes theorem is when giving a patient a medical diagnosis. With the Bayes theorem you can calculate the probability of a
patient having a certain disease based on the symptoms the patient has. Showing how essential such an equation can be.


## 3.3 Law of total probability 
The law of total probability expresses the probability that a certain outcome will occur; it adds up all the probabilities of distinct events that lead to this same outcome. To make it more clear let's throw two indistinguishable dice and calculate the probability that the sum of the two dice is four. There are two distinct events that have this outcome, namely $2 + 2$ and $3 + 1$. In this case the probabilities of these two distinct events are respectively $\frac{1}{36}$ and $\frac{1}{18}$, the law of total probability tells us then that the total probability of having four as the outcome is $\frac{1}{36} + \frac{1}{18} = \frac{1}{12}$.

### 3.3.1 Discrete case
Let $A = \{A_1, A_2, ..., A_n\}$ be a set of collectively exhaustive and disjoint events, meaning that one and only one of these events will occur. Then the total probability of an event $B$ happening is
\begin{equation}
P(B) = \sum_n P(B|A_n)P(A_n)   
\end{equation}


### 3.3.2 Continuous case


## 3.4 The difference between frequentist and Bayesian views

When a frequentist talks about probability, it is all about probability in the long run. This means that the data set that is collected and analyzed is part of numerous hypothetical data sets that address the same question and the uncertainty is only due to the sampling error. Bayesian probability is considered to be a degree of belief. From the Bayesian perspective, a probability is given to a hypothesis. Meanwhile, under frequentist interference, the hypothesis is normally tested without being given a probability. 

To give an example of how a frequentist and a Bayesian look at probability, we will look at the problem of flipping many coins and the probability of the coins being heads. From the frequentist approach, we know that when you flip a coin many times it is equally likely to be either heads or tails. Therefore there is a 50 percent probability of the coins being heads. However, from a Bayesian point of view, you are completely uncertain and believe there to be an equally likely outcome for the coins to be heads or tails. Both approaches here work because the problem is about flipping many coins. However, if we reduce this problem to only flipping one coin the frequentist approach will make no sense as it is not about a long-run frequency and just a singular occurrence. 

Another example is in the medical field when you want to diagnose a patient. In the frequentist approach, the doctor would look at the current complaints the patient has and compare that to previous records of patients with similar pains and what their diagnosis was to hopefully get a diagnosis for the current patient. In the Bayesian approach, the doctor would also take the patient's previous medical records into account. This means that they will also consider prior knowledge for a diagnosis. 

An important difference is that Bayesian inference takes prior knowledge into account and the parameter is taken as a random variable. This means that there is a probability the event will occur. With the Bayesian mindset, you use probability to measure the likelihood of an event happening. So it is what you believe.
With frequentist inference, the parameter is not a random variable and is assumed to be fixed, meaning there is no probability. From the frequentist mindset, you treat probability the same as frequency, where the probability depends on a certain event happening when the experiment is repeated infinite times. 

Another difference is how they do hypothesis testing. When a frequentist tests a hypothesis they will focus on the probability the null hypothesis gives their observed data. When a Bayesian tests the hypothesis they consider the probability of competing hypothesis with the data they observed, which directly looks at the researcher’s belief about the hypotheses.



Two competing philosophies on statistics have sparked debates. These two approaches, Frequentist and Bayesian, differ in how they treat probabilities and statistical inference.

A frequentist assigns probabilities to data rather than hypotheses, focusing on how often events occur in the long run. In this framework, confidence intervals are designed to include the true parameter a certain percentage of the time when experiments are repeated many times. As the number of repetitions increases, false outliers become less frequent. Statistical methods are built to perform well under these repeatable conditions. Importantly, the true parameters of the probability model are seen as fixed values. This means we cannot make probabilistic statements about them; they are either 100% true or 0% true.

Bayesians assign probabilities to hypotheses, viewing probabilities as degrees of belief. This perspective allows for probabilistic statements about unknown parameters even before any data is observed, with probabilities ranging from 0% to 100%. The "prior probability" reflects beliefs about a parameter before data is collected, and Bayesian statistics is designed to update these beliefs in response to new evidence. By incorporating prior knowledge, Bayesian models refine the probabilities of hypotheses as more data becomes available, offering a dynamic and flexible approach to inference.

In many cases, Bayesian methods closely resemble other statistical approaches, especially when working with large samples from a fixed model. For smaller sample sizes, many conventional methods can be understood as approximations to Bayesian inferences based on specific prior distributions. Recognizing the implicit priors in these methods can provide valuable insights into their underlying assumptions. However, some methods, like hypothesis testing, may yield results that differ substantially from those obtained using Bayesian approaches, highlighting philosophical and practical distinctions.

The debate between frequentism and Bayesianism often revolves around the use of priors, which frequentists criticize for introducing potential biases into analysis. Despite this critique, Bayesian methods are widely regarded as reasonable if the chosen prior is sensible. The key requirement is that the prior assigns a positive probability to all parameter values, ensuring that posterior estimates remain consistent and asymptotically normal as sample sizes increase. Even when a prior is biased, its influence diminishes at a rate of $1/n$ making it negligible compared to the $1/\sqrt(n)$ contribution of variance in the long run.

[example of marganilisation in Bayesian in astronomy. and showing that this is beter than frequentist]

Since this course is BASTA, we will continue working on statistics from the Bayesian view. 

references:
https://www.redjournal.org/article/S0360-3016(21)03256-9/fulltext#:~:text=%3A%20the%20frequentist%20approach%20assigns%20probabilities,as%20more%20data%20become%20available.

www.stat.columbia.edu/~gelman/book/BDA3.pdf

STAN Lecture notes 

https://towardsdatascience.com/statistics-are-you-bayesian-or-frequentist-4943f953f21b
