<h1 align="center">Probability and Statistics</h1>

<h3 align="center">Probability versus Statistics</h3>

- Probability and statistics are two closely related mathematical subjects.
- The main difference between probability and statistics:
 - **Probability** is a numerical description of how likely an event is to occur;
 - **Statistics** is the discipline that concerns the collection, organization, analysis, interpretation and presentation of data.

<center><img src="images/RM_PvsS.jpg" width="700" height="300" alt="Example" /></center>
  

<h3 align="center">Sigma-algebra</h3>

$\textbf{Definition}$. Let $X$ be some set, and let $\mathcal{P}(X)$ represent its power set, i.e. $\mathcal{P} \equiv   2^X = \{A | A \subseteq X\}$.
<br>
A set $\Sigma \subset \mathcal{P}$ is called $\sigma$-algebra on a set $X$, if it satisfies the following three properties:
<br> &emsp; $\bullet$ $X \in \Sigma$, i.e. $X$ is considered to be the **universal set**;
<br> &emsp; $\bullet$ $\Sigma$ is **closed under complementation**, i.e. if $A \in \Sigma$ then $X \setminus A \in \Sigma$;
<br> &emsp; $\bullet$ $\Sigma$ is **closed under countable unions**, i.e if $A_i \in \Sigma$ for $i = \overline{1, \infty}$ then $A = \left ( \bigcup_{i=1}^{\infty}A_i \right ) \in \Sigma$.

From these properties we can easily conclude that:
<br> &emsp; $\bullet$ $\emptyset \in \Sigma$;
<br> &emsp; $\bullet$ For every set $X$ the **smallest** $\sigma$-algebra will be $\{X, \emptyset\}$ and the **largest** will be $\mathcal{P}(X)$;
<br> &emsp; $\bullet$ $\Sigma$ is **closed under countable intersections**, i.e. if $A_i \in \Sigma$ for $i = \overline{1, n}$ then $A = \left ( \bigcap_{i=1}^{n}A_i \right ) \in \Sigma$.

$\textbf{Definition}$. The pair  $(X, \Sigma)$ is called **mesurable space**, or **Borel space**, and the elements of the $\Sigma$ are called **mesurable sets**.


$\textbf{Definition}$. The function $f:(X, \Sigma_X) \to (Y, \Sigma_Y)$ between two mesurable spaces is called **mesurable function** if for all $A \in \Sigma_Y$ the set $f^{-1}(A) \in \Sigma_X$.

<h3 align="center">Mesure Space</h3>

$\textbf{Definition}$. Let $(X, \Sigma)$ be a measurable space.
<br> A function $\mu: \Sigma \to [0, \infty]$ is called **mesure** if it satisfies the following properties:
<br> &emsp; $\bullet$ **Non-negativity**, i.e. for all $A \in \Sigma$, we have $\mu(A) \geq 0$;
<br> &emsp; $\bullet$ **Null empty set**, i.e. $\mu(\emptyset) = 0$;
<br> &emsp; $\bullet$ **Countable additivity**, i.e for pairwise disjoint sets $A_i \in \Sigma$ for $i = \overline{1, \infty}$, the following holds:

$$\mu \left( \bigcup_{i=1}^\infty A_i\right)=\sum_{i=1}^\infty \mu(A_i).$$

$\textbf{Definition}$. A tripple $(X, \Sigma, \mu)$ is called a **measure space**, where:
<br> &emsp; $\bullet$ $X$ is a set;
<br> &emsp; $\bullet$ $\Sigma$ is a $\sigma$-algebra on the set $X$;
<br> &emsp; $\bullet$ $\mu$ is a measure on $(X, \Sigma)$.

<h3 align="center">Probability</h3>

$\textbf{Definition}$. A measure space $(\Omega , \Sigma, P)$ is called **probability space** if $P(\Omega ) = 1$.

The element of $\Sigma$ is called **events**.

For each pair of events $A \in \Sigma$ and $B \in \Sigma$, the following properties of the probability space are valid:
<br> &emsp; $\bullet$ $P(\emptyset) = 0$;
<br> &emsp; $\bullet$ $0 \le P(A) \le 1$
<br> &emsp; $\bullet$ $P(\Omega  \setminus A) = 1 - P(A)$;
<br> &emsp; $\bullet$ if $A \subset B$ then $P(A) \le P(B)$;
<br> &emsp; $\bullet$ $P(A \cup B) = P(A) + P(B) - P(A \cap B)$.

<h3 align="center">Example 1</h3>

The experiment consists of tossing a fair coin, the outcome is either **H**eads or **T**ails.

Thus, our probability space $(\Omega, \Sigma, P)$ is:
<br> &emsp; $\bullet$ $\Omega = \{H, T\}$;
<br> &emsp; $\bullet$ The $\sigma$-algebra contains 4 events: $\Sigma = 2^\Omega = \left \{ \{\emptyset\}, \{H\}, \{T\}, \{H, T\} \right \}$;
<br> &emsp; $\bullet$ The probability measure $P$ in this example is:
$$P(\{\emptyset\}) = 0, \text{ } P(\{H\}) = 0.5, \text{ } P(\{T \}) = 0.5, \text{ } P(\{ H, T \}) = 1.$$


<h3 align="center">Example 2</h3>

A number between $0$ and $1$ is chosen at random, uniformly.
<br>
The open intervals of the form $(a,b)$, where $0 < a < b < 1$, could be taken as the generator sets.
<br>
Each set can be ascribed the probability of $P = (b − a)$, which generates the **Lebesgue measure** on $[0,1]$.

Thus, our probability space $(\Omega, \Sigma, P)$ is:
<br> &emsp; $\bullet$ $\Omega = [0, 1]$;
<br> &emsp; $\bullet$ $\Sigma$ is a $\sigma$-algebra of **Borel set** on $\Omega$;
<br> &emsp; $\bullet$ The probability measure $P$ in this example is the Lebesgue measure on $[0,1]$.

<h3 align="center">Exercises 11.1</h3>

The experiment consists of tossing a fair coin three times, the outcome of each toss is either **H**eads or **T**ails.

Define the probability space $(\Omega, \Sigma, P)$.

<h3 align="center">Independent events</h3>

Let $(\Omega, \Sigma, P)$ be a probability space.

$\textbf{Definition}$. Two events $A\in \Sigma$ and $B\in \Sigma$ are **independent** if:

$$P(A \cap B) = P(A)P(B)$$

$\textbf{Definition}$. A finite set of events $\{A_i\}_{i=1}^{n}$ are **pairwise independent** if for any $i,j \in \overline{1, n}$:

$$P(A_i \cap A_j) = P(A_i)P(A_j)$$

$\textbf{Definition}$. A finite set of events $\{A_i\}_{i=1}^{n}$ are **mutually independent** if:

$$P(\bigcap_{i=1}^{n}A_i) = \prod_{i=1}^{n}P(A_i)$$

$\textbf{Note}$. Any collection of **mutually independent random variables is pairwise independent**, but some **pairwise independent collections are not mutually independent**.

<h3 align="center">Example</h3>

Let's consider the example describe in **Exercise 11.1**.

Let:
- $A$ be the event **Toss 1** and **Toss 2** give the same result;
- $B$ be the event **Toss 2** and **Toss 3** give the same result;
- $C$ be the event **Toss 3** and **Toss 1** give the same result.

We have:
$$P(A) = P(B) = P(C) = \frac{1}{2} \text{ and } P(A \cap B) =  P(B \cap C) = P(C \cap A)  = \frac{1}{4}.$$

However, it is clear that $A$, $B$, and $C$ are not mutually independent.

<h3 align="center">Exercises 12.1</h3>

The experiment consists of tossing the fair dice and $A = \{2, 4, 6\}$ and $B = \{1, 2, 3, 4\}$ are two events.

Define the probability space $(\Omega, \Sigma, P)$ and prove that $A$ and $B$ are independent events.


<h3 align="center">Conditional probability</h3>

Let $(\Omega, \Sigma, P)$ be a probability space and $A\in \Sigma$ and $B\in \Sigma$ are two events.

$\textbf{Definition}$. The **conditional probability** of $A$ given $B$ is defined as the quotient of the probability of the joint of events $A$ and $B$, and the probability of $B$.

Or in other words, if $P(B) \gt 0$, then **conditional probability** of $A$ given $B$ is given as:

$$P(A|B) = \frac{P(A\cap B)}{P(B)}.$$

$\textbf{Statement}$. $A$ and $B$ are independent events if and only if $P(A|B) = P(A)$.

$\textbf{Proof}$. If $A$ and $B$ are independent events, then $P(A|B) = \frac{P(A \cap B)}{P(B)} = \frac{P(A)P(B)}{P(B)} = P(A).$
<br>
Opposite, if $P(A|B) \equiv \frac{P(A \cap B)}{P(B)} = P(A)$, then $P(A \cap B) = P(A)P(B)$ and $A$ and $B$ are independent.

<h3 align="center">Total Probability Theorem</h3>

Let $(\Omega, \Sigma, P)$ be a probability space.

$\textbf{Theorem} \space \textbf{17}$. If $B_1, B_2, B_3, \cdots$ is a countable (or finite) partition of $\Omega$, then for any event $A \subset \Omega$:

$$P(A) = \sum_{i}P(A|B_i)P(B_i).$$

$\textbf{Proof}$. Since $B_1, B_2, B_3, \cdots $ is a partition of the $\Omega$, we can write:
$$\Omega = \bigcup_{i}B_i \text{ and } A = A \cap \Omega = A \cap \left ({ \bigcup_i B_i }\right ) = \bigcup_i \left ( A \cap B_i \right ).$$

Now note that the sets $A \cap B_i$ are disjoint since the $B_i$'s are disjoint. Thus:

$$P(A) = P\left  (\bigcup_i ( A \cap B_i )\right ) = \sum_{i}P( A \cap B_i ) = \sum_{i} P(A | B_i) P(B_i).$$

<h3 align="center">Bayes' Theorem</h3>

Let $(\Omega, \Sigma, P)$ be a probability space and $A\in \Sigma$ and $B\in \Sigma$ are two events.

$\textbf{Theorem} \space \textbf{18}$. **Bayes’ theorem** is stated mathematically as the following equation:
$$P(A|B) = \frac{P(B|A)P(A)}{P(B)}.$$
where:
<br> &emsp; $\bullet$ $P(A|B)$ is a conditional probability of occurring $A$ given that $B$ is true;
<br> &emsp; $\bullet$ $P(B|A)$ is a conditional probability of occurring $B$ given that $A$ is true;
<br> &emsp; $\bullet$ $P(A)$ and $P(B)$ are the probabilities of observing $A$ and $B$ respectively.


$\textbf{Proof}$. We can rewrite the definitions of $P(A|B)$ and $P(B|A)$ in the following forms:

$$P(A|B)P(B) = P(A \cap B) \text{ and } P(B|A)P(A) = P(B \cap A).$$

Equating the two yields, we get $P(A|B)P(B) = P(A \cap B) = P(B \cap A) = P(B|A)P(A)$, and thus: 
$$P(A|B) = \frac{P(B|A)P(A)}{P(B)}.$$

<h3 align="center">Example</h3>

**Chris Wiggins**, an associate professor of applied mathematics at **Columbia University**, posed the following problem in an article in **Scientific American** (<a href = 'https://www.scientificamerican.com/article/what-is-bayess-theorem-an/'>Link to the article in Scientific American</a>):

$\textbf{Problem}$. A patient goes to see a doctor. The doctor performs a test with **99%** reliability - that is, **99%** of people who are sick test positive and **99%** of the healthy people test negative. The doctor knows that only **1%** of the people in the country are sick. Now the question is: **if the patient tests positive, what are the chances the patient is sick?**

The intuitive answer is **99 %**, but the correct answer is **50 %**.

$\textbf{Solution}$. Wiggins's explanation can be summarized with the help of the following table which illustrates the scenario in a hypothetical population of $10,000$ people:

|      | Diseased | Not Diseased |     |
|:----:|:--------:|:------------:|:---:|
|test +| 99       | 99           | 198 | 
|test -| 1        | 9801         | 9802| 
|      | 100      | 9900         |10000|

We want to know the probability of disease $(A)$ given that the patient has a positive test $(B)$, i.e. $P(A|B).$

$\bullet$ We know that the unconditional probability of disease is $1\%$, i.e. $P(A) = 0.01$;
<br> $\bullet$ The unconditional probability of a positive test is $P(B) = 198/10000 = 0.0198$;
<br> $\bullet$ We also know the sensitivity of the test is $99\%$, i.e. $P(B | A) = 0.99$.

Using the Bayes's Theorem we get:

$$P(A|B)= \frac{P(B|A)P(A)}{P(B)} = \frac{0.99 \cdot 0.01}{0.0198} = \frac{1}{2} = 50\%.$$

 

<h3 align="center">Exercises 13.1</h3>

Consider the same problem, but assuming that **50%** (instead of **1%**) of the people in the country are sick. 
<br>
What will be the answer on the same question in this case?

<h3 align="center">Extended Bayes' Theorem</h3>

Let $(\Omega, \Sigma, P)$ be a probability space.

$\textbf{Theorem} \space \textbf{19}$. If $B_1, B_2, B_3, \dots$ is countable (or finite) partition of $\Omega$ such that 
$P(B_i) > 0$ for each $i \in \{1, 2, 3, \dots\}$, 
then for any event $A \subset \Omega$, such that $P(A) > 0$, and for each $k \in \{1, 2, 3, \dots \}$:

$$P(B_k|A) = \frac{P(A|B_k)P(B_k)}{\sum_{i=1}P(A|B_i)P(B_i)}.$$


$\textbf{Proof}$. Let's consider any event $B \in \Omega$, such that $P(B)>0$. Using the total probability theorem on the conditional probability statement, we get:

$$P(B|A) = \frac{P(B \cap A)}{P(A)} = \frac{P(B \cap A)}{\sum_{i}P(A|B_i)P(B_i)}.$$

Now we use the definition of conditional probability:

$$P(B \cap A) = P(A \cap B) = P(A|B) \cdot P(B).$$

Substituting this in the expression for $P(B|A)$ we immediately obtain the result:

$$P(B|A) = \frac{P(A|B) \cdot P(B)}{\sum_{i}P(A|B_i)P(B_i)}.$$

This is true for any event $B$ and so, replacing $B$ by $B_k$, we get:

$$P(B_k|A) = \frac{P(A|B_k)P(B_k)}{\sum_{i=1}P(A|B_i)P(B_i)}$$

<h3 align="center">Naive Bayes classifier</h3>

- **What is a Classifier?**
<br> A classifier is a machine learning model that is used to discriminate different objects based on certain features.

- **Principle of Naive Bayes Classifier:**
<br> A Naive Bayes classifier is a probabilistic machine learning model that’s used for classification task. The crux of the classifier is based on the Bayes theorem.

- **What does Naive Bayes mean?**
<br> Naive Bayes classifiers assume strong, or **naive**, independence between the events.

- **Where is Naive Bayes Classifier used?**
<br> Popular uses of Naive Bayes Classifiers include **spam filters**, **text analysis** and **medical diagnosis**. 

- On the **Workshop №3** we will solve the **problem of playing golf** using the Naive Bayes Classifier.

<h3 align="center">Random variable</h3>

Let $(\Omega, \Sigma, P)$ be a probability space.

$\textbf{Definition}$. A function $X: \Omega \to \mathbb{R}$ is called a **random variable** if for any Borel set $\mathcal{B} \in \mathbb{R}$, the set $X^{-1}(\mathcal{B})$ is an event in $\Sigma$. The probability that $X$ takes on a set $\mathcal{B} \in \mathbb{R}$ is written as:

$$P(X \in \mathcal{B}) = P(X^{-1}(\mathcal{B}))= P(\{\omega \in \Omega | X(\omega) \in \mathcal{B} \}).$$

$\textbf{Definition}$. Two random variables $X$ and $Y$ are independent if:

$$P(X \in \mathcal{A}, Y \in \mathcal{B}) = P(X \in \mathcal{A})\cdot P(Y \in \mathcal{B}).$$


$\textbf{Definition}$. The **Cumulative Distribution Function** (**CDF**) $F_X:\mathbb{R} \to [0, 1]$ of a real-valued **random variable** $X$ is the function given by:

$$ F_X(x) = P(X \le x).$$

$\textbf{Theorem} \space \textbf{20}$. A function $F_X:\mathbb{R} \to [0, 1]$ is cumulative distribution function if and only if:
- $F$ is **non decreasing**, i.e. for each $x_1 \lt x_2$ we have $F(x_1) \le F(x_2)$;
- $F$ is **normalized**, i.e. $$\lim_{x \to -\infty}F(x) = 0 \text{ and } \lim_{x \to +\infty}F(x) = 1;$$
- $F$ is **rights continuous**.

$\textbf{Proof}$. 
The proofs of this theorem we leave to the students!

<h3 align="center">Discret random variable</h3>

$\textbf{Definition}$. Random variable $X$ is discrete if it takes countably (or finite) many values $\{x_1, x_2, \cdots \}$.

We can define the probability mass function: $f_X(x)=P(X=x)$. This function will have next properties:

&emsp; $\bullet$ $f_X(x) \geq 0$  for each $x \in \mathbb{R}$;
 
&emsp; $\bullet$ $\sum_{x}f_X(x) = 1.$

Thus, the cumulative distribution function can be defines as:

$$F_X(x) = P(X \le x) = \sum_{x_i \le x}f_X(x_i)$$


<h3 align="center">Examples: Bernoulli distribution</h3>

- $\operatorname{Bernoulli}(p)$: we have an experiment with two outcomes with probability $p$ and $q=(1 - p)$:

$$\operatorname{P}(X = k) = p^kq^{1 - k} \text{ for } k \in \{0, 1\}.$$

- **Example**: Any problem where events having exactly two outcomes.

<center><img src="images/RM_Bernoulli.svg" width="1500" height="300" alt="Example" /></center>


<h3 align="center">Examples: Binomial distribution</h3>

- $\operatorname{Binomial}(n, p)$: we have $n$ experiments with two outcomes with probability $p$ and $q=(1-p)$ each:

$$
\operatorname{P}(X = k) = 
\left\{\begin{matrix}
C_n^k p^kq^{n-k}, & \text{ for } k = 0, 1, \dots, n \\
0 & \text{ otherwise}
\end{matrix}\right.
;$$

- **Example**: Any problem where you have $n$ events and each event has exactly two outcomes.

<center><img src="images/RM_Binomial.svg" width="800" height="300" alt="Example" /></center>


<h3 align="center">Examples: Geometric distribution</h3>

- $\operatorname{Geometric}(k)$: we have $k$ experiments with two outcomes until the first success.
$$\operatorname{P}(X=k) = p(1-p)^{k-1}.$$
- **Example**: What is probability  that there are $k$ failures to get the first success in $k$ Bernoulli trials?
<center><img src="images/RM_Geometric.svg" width="1000" height="300" alt="Example" /></center>


<h3 align="center">Example: Hypergeometric distribution</h3>

- $\operatorname{Hypergeometric}(N,K,n)$:
<br>
$$\operatorname{P}(X = k) = \frac{\binom{K}{k} \binom{N - K}{n-k}}{\binom{N}{n}}.$$
- **Example**: Suppose a deck of cards contains 20 cards: 6 red cards and 14 black cards. 5 cards are drawn randomly without replacement. What is the probability that exactly 4 red cards are drawn?
<center><img src="images/RM_Geometric.svg" width="800" height="300" alt="Example" /></center>


<h3 align="center">Example: Poisson distribution</h3>

- $\operatorname{Poisson}(k, \lambda)$:
<br>
$$\operatorname{P}(X = k) = \frac{e^{-\lambda} \lambda^{k}}{k!}.$$
- **Example**: Poisson distribution is applied in situations where there are a large number of independent Bernoulli trials with a very small probability of success.
<center><img src="images/RM_Poisson.svg" width="800" height="300" alt="Example" /></center>


<h3 align="center">Continuous random variable</h3>

$\textbf{Definition}$. Random variable $X$ is continouos if there exists a positivly defined function $f:\mathbb{R} \to \mathbb{R}$, <br> i.e.  $f(x) \geq 0$ for all $x \in \mathbb{R}$, such that:

$$P(a \lt x \lt b) = \int_{a}^{b}f(x)dx;$$

The function $f(x)$ is called **Probability Desity Function** (**PDF**) and we have that: 

$$F_X(x)=\int_{-\infty}^{x}f(t)dt,$$

and
$$f(x) = F_X'(x) \text{ for all points } x \in \mathbb{R} \text{ where } F_X \text{ is differentiable}.$$


<h3 align="center">Examples: Uniform distributions</h3>

&emsp;$\bullet$ $\operatorname{Uniform distribution}$: 
<br> &emsp; &emsp; $\bullet$ Discrete: $f(x) = 
\left\{\begin{matrix}
\frac{1}{n}, & \text{ for } x = 1, 2, \dots, n \\
0 & \text{ otherwise}
\end{matrix}\right.
;$
<br> &emsp;&emsp; $\bullet$ Continuous: 
 $ f(x) = 
\left\{\begin{matrix}
\frac{1}{b-a}, & \text{ for } x \in [a, b] \\
0 & \text{ otherwise}
\end{matrix}\right.
.$

<center><img src="images/RM_Uniform.svg" width="1200" height="300" alt="Example" /></center>

<h3 align="center">Examples: Normal distribution</h3>

- $\operatorname{Gaussian}(\mu, \sigma^2)$, where $\mu \in \mathbb{R}$ and $\sigma \gt 0$, :
$f(x) = \frac{1}{\sigma \sqrt{2\pi}}\exp{(-\frac{1}{2\sigma^2}(x - \mu)^2)}
.$

    We say that $X$ had standard normal distribution if $\mu=0$ and $\sigma=1$.

<center><img src="images/RM_Normal.svg" width="1500" height="300" alt="Example" /></center>



<h3 align="center">Expected value</h3>

$\textbf{Definition}$. Let $(\Omega, \Sigma, P)$ be a probability space. The **expectation** of a random variable $X$ is defined as:
$$\operatorname{E} [X]  = \int_\Omega X(\omega)\,d\operatorname{P}(\omega).$$

For discret and continuous random variables it takes the following form:
$$
\operatorname{E}[X] = 
\left\{\begin{matrix}
\sum_x xf(x), & \text{ if } X \text { is discrete}; \\
\int_{\mathbb{R}} x f(x)\, dx, & \text{ if } X \text { is continuous};
\end{matrix}\right.
.$$

In many cases **expectation** is denoted by $\mu$.

Expectation value has the following properties:
<br> &emsp; $\bullet$ $\operatorname{E}[X + Y] = \operatorname{E}[X] + \operatorname{E}[Y];$
<br> &emsp; $\bullet$ $\operatorname{E}[aX]    = a \operatorname{E}[X];$
<br> &emsp; $\bullet$ If $X_1, X_2, \dots, X_n$ are independent random variables, then:
$$\operatorname{E} \left [\prod_{i=1}^nX_i \right ] = \prod_{i=1}^n \operatorname{E} \left [X_i \right ].$$


<h3 align="center">Variance</h3>


$\textbf{Definition}$. The **variance** of a random variable $X$ is the expected value of the squared deviation from the mean of $X$:

$$\begin{align}
\operatorname{Var}(X) &= \operatorname{E}\left[(X - \operatorname{E}[X])^2\right] \\[4pt]
&= \operatorname{E}\left[X^2 - 2X\operatorname{E}[X] + \operatorname{E}[X]^2\right] \\[4pt]
&= \operatorname{E}\left[X^2\right] - 2\operatorname{E}[X]\operatorname{E}[X] + \operatorname{E}[X]^2 \\[4pt]
&= \operatorname{E}\left[X^2 \right] - \operatorname{E}[X]^2.
\end{align}$$

In many cases expectation is denoted by $\sigma^2$.

Variance has the following properties:
<br> &emsp; $\bullet$ $\operatorname{Var}(X + Y) = \operatorname{Var}(X) + \operatorname{Var}(Y);$
<br> &emsp; $\bullet$ $\operatorname{Var}(\alpha \cdot X) = \alpha^2\operatorname{Var}(X).$

<h3 align="center">Covariance</h3>


$\textbf{Definition}$. Let $X$ and $Y$ be a random variables. The **covariance** is defined as the expected value of the product of their deviations from their individual expected values:

$$\begin{align}
\operatorname{Cov}(X, Y) &= \operatorname{E}\left[(X - \operatorname{E}[X]) (Y - \operatorname{E}[Y])\right] \\[4pt]
&= \operatorname{E}\left[XY - X\operatorname{E}[Y] - Y\operatorname{E}[X] + \operatorname{E}[X]\operatorname{E}[Y] \right] \\[4pt]
&= \operatorname{E}\left[XY\right] - \operatorname{E}[X]\operatorname{E}[Y] - \operatorname{E}[X]\operatorname{E}[Y] + \operatorname{E}[X]\operatorname{E}[Y] \\[4pt]
&= \operatorname{E}\left[XY \right] - \operatorname{E}[X]\operatorname{E}[Y].
\end{align}$$

Covariance has the following property:
$$-1 \leq Cov(X, Y) \leq 1.$$

<h3 align="center">Covariance Matrix</h3>

$\textbf{Definition}$. Let $X = (X_1, X_2, ... , X_n)$ and $Y = (Y_1, Y_2, ... , Y_n)$ be the random vector variables with finite variance and expected value. 
The covariance matrix $\operatorname{K}_{XX}$ is the matrix whose $(i,j)$ entry is:

$$\operatorname{K}_{X_i X_j} = \operatorname{cov}[X_i, X_j] = \operatorname{E}[(X_i - \operatorname{E}[X_i])(X_j - \operatorname{E}[X_j])].$$

In other words:

$$
\operatorname{K}_{\mathbf{X}\mathbf{X}}=\begin{bmatrix}
 \mathrm{E}[(X_1 - \operatorname{E}[X_1])(X_1 - \operatorname{E}[X_1])] & \mathrm{E}[(X_1 - \operatorname{E}[X_1])(X_2 - \operatorname{E}[X_2])] & \cdots & \mathrm{E}[(X_1 - \operatorname{E}[X_1])(X_n - \operatorname{E}[X_n])] \\ \\
 \mathrm{E}[(X_2 - \operatorname{E}[X_2])(X_1 - \operatorname{E}[X_1])] & \mathrm{E}[(X_2 - \operatorname{E}[X_2])(X_2 - \operatorname{E}[X_2])] & \cdots & \mathrm{E}[(X_2 - \operatorname{E}[X_2])(X_n - \operatorname{E}[X_n])] \\ \\
 \vdots & \vdots & \ddots & \vdots \\ \\
 \mathrm{E}[(X_n - \operatorname{E}[X_n])(X_1 - \operatorname{E}[X_1])] & \mathrm{E}[(X_n - \operatorname{E}[X_n])(X_2 - \operatorname{E}[X_2])] & \cdots & \mathrm{E}[(X_n - \operatorname{E}[X_n])(X_n - \operatorname{E}[X_n])]
\end{bmatrix}
.$$

<h3 align="center">Sample Covariance</h3>

- **Sample covariance** measures how two random vector variables $X = (X_1, X_2, \cdots , X_n)$ and $Y = (Y_1, Y_2, \cdots , Y_n)$ differ from their mean:
$$Cov(X,Y) = \frac{1}{n-1} \sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y}).$$
- Covariance can be understood as the **variability due to codependence**, whereas the variance is the **independent variability**.

- **Positive covariance**: two variables are both above or both below their respective means. Variables with a positive covariance are **positively correlated** - they go up or done together. 

- **Negative covariance**: one variable tends to be above the mean and the other below their mean in other words, **negative covariance** means that if one variable goes up, the other variable goes down.

- Similar to variance, the dimension of the covariance is ${unit}^2$. 



<h3 align="center">Convergence of random variables</h3>

$\textbf{Definition}$. A sequence $X_1, X_2, X_3, \dots$ of random variables is said to **converge weakly** or **converge in distribution** to a random variable $X$ if:

$$\lim_{n \to \infty} F_n(x)=F(x),$$

for every number $x \in \mathbb {R}$ at which $F$ is continuous. 
<br>Here $F_n$ and $F$ are the cumulative distribution functions of random variables $X_n$ and $X$, respectively.


$\textbf{Definition}$. A sequence $X_1, X_2, X_3, \dots$ of random variables is said to **converge strongly** or **converge almost surely** to a random variable $X$ if:

$$P\left( \lim_{n \to \infty} X_n = X \right)=1.$$


<h3 align="center">Law of large numbers</h3>

$\textbf{Concept}$. The law of large numbers states that as a sample size grows, its mean gets closer to the average of the whole population. 

Let  $X_1, X_2, \cdots$ be a **independent and identically distributed** random variables, then:

$\textbf{Theorem} \space \textbf{20}$. The **weak law of large numbers**, or **Khinchin's law**, states that the sample average **converges in probability** towards the expected value:
$$\lim_{n \to \infty}\overline{X}_n \xrightarrow {P} \mu,$$
i.e for any positive number $\epsilon$:
$$\lim_{n \to \infty} P(|\overline{X}_n - \mu| > \epsilon) = 0.$$
$\textbf{Theorem} \space \textbf{21}$. The **strong law of large numbers** states that the sample average converges **almost surely** to the expected value:
$$ P \left( \lim _{n\to \infty } \overline{X}_n = \mu \right)=1.$$


<h3 align="center">Central Limit Theorem</h3>

$\textbf{Concept}$. Given certain conditions, the arithmetic mean of a **sufficiently large number** of iterates of **independent random variables**, each with a well-defined expected value and well-defined variance, will be approximately **normally distributed**, regardless of the underlying distribution.

Let, $X_{1}, X_{2}, ..., X_{N}$ be a sequence of independent and identically distributed random variables with the mean $\operatorname{E}[X_{i}]=\mu$ and finite variance $Var[X_{i}]=\sigma^2<\infty$. 

$\textbf{Theorem} \space \textbf{21}$. As $n$ approaches infinity, the random variables $\frac{\sqrt{n}}{\sigma} \left( S_n - \mu \right)$ converge in distribution to a normal $N(0, 1)$:
$$\lim_{n \to \infty} \frac{\sqrt{n}}{\sigma}\left( S_n - \mu\right) \xrightarrow{d} N(0, 1),$$

where $S_n$ is a sample mean for a choosen $n$:
 $$S_n = \frac{1}{n}\sum_{i=1}^{n} X_i.$$