# 5. Inequalities and Limit Theorems 

1. ~~Convexity.~~
2. ~~Gibbs’, Markov’s, and Chebyshev’s inequalities.~~
3. Modes of convergence. 
4. ~~Weak and strong laws of large numbers.~~
5. ~~Central limit theorem.~~
6. ~~Kullback–Leibler divergence, cross-entropy, and mutual information.~~


## `1. Convexity`:

### **`Def. 1` Convex Set**:
A set $S \subseteq \mathbb{R}^n$ is **convex** if for any $x, y \in S$ and any $\theta \in [0, 1]$, the point 
$$
\theta x + (1 - \theta) y \in S
$$

> * **Meaning**: For any two points in the set, the line segment connecting them lies entirely within the set.


### **`Def. 2` Convex Function**:
A function $f: \mathbb{R}^n \to \mathbb{R}$ is **convex** if its domain $Dom(f)$ is a convex set and for all $x, y \in Dom(f)$ and any $\forall \theta \in [0, 1]$, the following inequality holds:
$$
f(\theta x + (1 - \theta) y) \leq \theta f(x) + (1 - \theta) f(y)
$$

> * **Meaning**: The line segment connecting any two points on the graph of the function lies above or on the graph itself.


## `3. Modes of convergence` *(supplement — not present in this PDF)*

### Motivation
In probability we have several different notions of “$X_n$ approaches $X$”.
They are NOT equivalent, so it’s important to know definitions + implication arrows.



### `Def. (standard) Almost sure convergence`
$$
X_n \to X \ \text{a.s.}
\quad \Longleftrightarrow \quad
P\left(\{\omega:\lim_{n\to\infty}X_n(\omega)=X(\omega)\}\right)=1.
$$



### `Def. (standard) Convergence in probability`
$$
X_n \xrightarrow{P} X
\quad \Longleftrightarrow \quad
\forall \varepsilon>0:\ \ P(|X_n-X|>\varepsilon)\to 0.
$$



### `Def. (standard) $L^p$ convergence`
For $p\ge 1$:
$$
X_n \xrightarrow{L^p} X
\quad \Longleftrightarrow \quad
\mathbb E\big[|X_n-X|^p\big]\to 0.
$$


### `Def. (standard) Convergence in distribution`
$$
X_n \xrightarrow{d} X
\quad \Longleftrightarrow \quad
F_{X_n}(x)\to F_X(x) \ \ \text{for all continuity points of }F_X.
$$



### `Standard implication chain (important to memorise)`
$$
X_n \xrightarrow{a.s.} X
\ \Rightarrow\
X_n \xrightarrow{P} X
\ \Rightarrow\
X_n \xrightarrow{d} X.
$$

Also:
$$
X_n \xrightarrow{L^p} X \ \Rightarrow\ X_n \xrightarrow{P} X.
$$

# `6. Kullback–Leibler Divergence, cross-entropy, and mutual information`


## `Kullback–Leibler Divergence:`


Entropy works fine when measuring uncertainty within a single distribution, but what if we want to measure the difference between two distributions?
If we have two distributions $P$ and $Q$ over the same set of events, how can we quantify how different they are?
The natural way is to compare their ratio:
$$
\text{Ralative likelihood} = \frac{P(x)}{Q(x)}
$$

Beucase probability measures themselves are scale-free and live in an abstract measure space. The RN-derivative simply tells us how to re-scale one measure to get the other.

Holding onto entropy as a measure of surpise, what exactly is the surprize we experience?

If reality behaves like $P$ but we expect it to behave like $Q$?

We define an average surprise under $P$:
$$
E_P\left[-\log Q(x)\right] = -\sum_x P(x)\log Q(x)
$$

And to find out hoe much of the surprise we owe to using the wrong model, we take the difference between this and the entropy of $P$:
$$
E_P\left[-\log Q(x)\right] - E_P\left[-\log P(x)\right] = -\sum_x P(x)\log Q(x) + \sum_x P(x)\log P(x) = \sum_x P(x)\log\frac{P(x)}{Q(x)}
$$
$$
E_{P}\left[\log\left(\frac{dP}{dQ}\right)\right] = \sum_x P(x)\log\frac{P(x)}{Q(x)}
$$

This is the **Kullback-Leibler divergence**.



### **`Def. 10.` KL Divergence**:

Let $(\Omega, \mathcal{F})$ be a measurable space and let $P$ and $Q$ be two probability measures on this space such that $P$ is absolutely continuous with respect to $Q$ (denoted as $P \ll Q$). The Kullback-Leibler (KL) divergence from $Q$ to $P$ is defined as:
$$
D_{KL}(P \| Q) = \int_{\Omega} \log\left(\frac{dP}{dQ}\right) dP = E_{P}\left[\log\left(\frac{dP}{dQ}\right)\right]
$$
where $\frac{dP}{dQ}$ is the Radon-Nikodym derivative of $P$ with respect to $Q$.

> * **Interpretation**: KL divergence measures how one probability distribution diverges from a second, expected probability distribution. It is not symmetric, meaning that generally $D_{KL}(P \| Q) \neq D_{KL}(Q \| P)$.

> * **Note**: If $P \not\ll Q$, we set $D_{KL}(P \| Q) = +\infty$.

### `Properties`:

1. **Non-negativity**: $D_{KL}(P \| Q) \geq 0$, with equality if and only if $P = Q$ almost everywhere.
2. **Chain Rule**: if $P_{X,Y}$ and $Q_{X,Y}$ are joint distributions, then
$$
D_{KL}(P_{X,Y} \| Q_{X,Y}) = D_{KL}(P_X \| Q_X) + \mathbb{E}_{P_X}[D_{KL}(P_{Y|X} \| Q_{Y|X})]
$$
3. **Invariance under reparameterization**: KL divergence remains unchanged under smooth and invertible transformations of the random variables.


### `Properties MUST be memorized`:

1. Non-negativity;
2. Additivity over independent samples;
3. Locality (depends only on density ratios $\frac{dP}{dQ}$).
4. Continuity.
5. Invariance under reparameterization.
6. Compatibility with Bayesian Inference.
7. Coding length stuff.



## `Cross-Entropy and its relation to KL Divergence`:

### **Related Concepts**:

1. Entropy:
$$
H(P) := -\int p(x). \log{p(x)} dx
$$

2. Cross-Entropy:
$$
H(P, Q) := -\int p(x). \log{q(x)} dx
$$

3. KL Divergence:
$$
D(P\|Q) := \int p(x). \log{\frac{p(x)}{q(x)}} dx = H(P, Q) - H(P)
$$

> **Note**: Cross-Entropy - (uncertainty of the world) + (penalty for wrong model) = Entropy (uncertainty of the world).

* $H(P)$ measures the uncertainty inherent of the data.
* $D(P\|Q)$ extra cost of encoding samples using the wrong distribution.

The **Cross-Entropy** gives the expected number of bits needed if the world follows $P$, but we assumed $Q$.


| **Quantity**       | **Formula**                              | **Interpretation**                                      |
|--------------------|---------------------------------------------|---------------------------------------------------------|
| Entropy            | $H(P) = -\int p(x) \log p(x) dx$ | Measures the uncertainty inherent in the distribution $P$. |
| Cross-Entropy      | $H(P, Q) = -\int p(x) \log q(x) dx$ | Measures the cost using a model $Q$. |
| Kullback-Leibler Divergence | $D(P \| Q) = \int p(x) \log \frac{p(x)}{q(x)} dx$ | Measures the extra cost due to mismatch |


## `Mutual Information:`

Picture the world as a dumpo of possible tiny outcomes. Probability model is just a rule that tells us how to weigh each tiny outcome.
* Model A (True) might put most of the weight in a small region.
* Model B (Our guess) might spread the weight differently.

Comparing them boils down to finding local ratios:
- At each possible outcome, how much more does A weigh it compared to B?
This ratio is the most primitice geometric fact about measures.


Now, say we observe a pair of things $(x, y)$. The pointwise information is simply the local surprise:
"How much more likely is this pair under the real joint behavior that it would be if the two where independent?"

- **Positive Value**: That pair happens more often together than if they were independent.
- **Negative Value**: That pair happens less often together than if they were independent.
- **Zero**: That pair happens just as often together as if they were independent.

In another words, this is what we call **Pointwise Mutual Information (PMI)**, and can be treated as the local curvature.



### **`Def. 11.1` Pointwise Mutual Information (PMI)**:

Given a pair of Random Variables $(X, Y)$, a measurable space with a joint probability distribution $P_{X,Y}$, and marginal distributions $P_X$ and $P_Y$, having $P_{X,Y} \ll P_X \otimes P_Y$, the Pointwise Mutual Information (PMI) between $X$ and $Y$ at the point $(x, y)$ is defined as:
$$
\text{pmi}(x, y) := \log\left(\frac{dP_{X,Y}}{d(P_X \otimes P_Y)}(x, y)\right)
$$

Whenever densities exist this reduces to:
$$
\text{pmi}(x, y) = \log\left(\frac{p(x, y)}{p(x) . p(y)}\right)
$$

> * **Interpretation**: PMI shows us how surprising (or not) for us it is to observe the pair $(x, y)$ together, compared to what we would expect if $X$ and $Y$ were independent.


### **`Def. 11.2` Mutual Information (MI)**:

Given a pair of Random Variables $(X, Y)$, we take their expected Pointwise Mutual Information (PMI):
$$
\text{MI}(X, Y) := E_{P_{XY}}[\text{PMI}(X, Y)]
$$

Which is called **Mutual Information (MI)** between $X$ and $Y$.
Equivalently this can be stated as:
$$
\text{MI}(X, Y) = D_{KL}(P_{XY} \| P_X \otimes P_Y) = \int \log\left(\frac{dP_{XY}}{d(P_X \otimes P_Y)}\right) dP(X, Y)
$$

> * **Geometric Meaning**:
>> 1. The joint $P_{XY}$ is a point in the probability measure space.
>> 2. The product of marginals $P_X \otimes P_Y$ is "the closest independent distribution".
>> 3. The MI is the information-geometric distance between these two points, measured by KL divergence.

So, MI tells how much non-independence there is between $X$ and $Y$.


#### `Average Evidence`:
- If the average is **ZERO**: On average, seeing $X$ gives no information about $Y$ (independent).
- If the average is **LARGE**: On average, samples tends to carry evidence that the joint is not the product of marginals (dependent).


### `Properties of Mutual Information`:

1. Non-negativity: $\text{MI}(X, Y) \geq 0$, with equality if and only if $X$ and $Y$ are independent.
2. Symmetry: $\text{MI}(X, Y) = \text{MI}(Y, X)$.
3. Zero iff Independence: $\text{MI}(X, Y) = 0 \iff P_{X,Y} = P_X \otimes P_Y$ ($X$ and $Y$ are independent).
4. Chain Rule: $\text{MI}(X, Y, Z) = \text{MI}(X, Y) + \text{MI}(X, Z | Y)$.
5. Relation to Entropy: $\text{MI}(X, Y) = H(X) + H(Y) - H(X, Y) = H(X) - H(X | Y) = H(Y) - H(Y | X)$.
6. Data Processing Inequality: If $X \to Y \to Z$ forms a Markov chain, then $\text{MI}(X, Z) \leq \text{MI}(X, Y)$.

# `2. Gibbs’, Markov’s, and Chebyshev’s inequalities.`


## **`Theorema 10.1` Gibbs’ Inequality**:

Given a measurable space $(\Omega, \mathcal{F})$ and two probability measures $P$ and $Q$ on this space such that $P$ is absolutely continuous with respect to $Q$ (denoted as $P \ll Q$), the Kullback-Leibler divergence from $Q$ to $P$ satisfies:
$$
D_{KL}(P \| Q) \geq 0
$$
with strict equality $\iff$ $P = Q$ (so measere coincide).

> * **Interpretation**: This inequality states that the KL divergence is always non-negative, and it is zero if and only if the two probability measures are identical almost everywhere.


## **`Theorema 10.2` Markov’s Inequality**:

Given a random variable $X \geq 0$ and a positive constant $a > 0$, the following inequality holds:
$$
P(X \geq a) \leq \frac{E[X]}{a}
$$

> * **Interpretation**: This inequality provides an upper bound on the probability that a non-negative random variable exceeds a certain value, based on its expected value.

## **`Theorema 10.3` Chebyshev’s Inequality**:

Given a random variable $X$ with finite expected value $\mu = E[X]$ and finite variance $\sigma^2 = Var(X)$, for any positive constant $k > 0$, we have a normalized Markov applied to $(X - E[X])^2$ , the following inequality holds:
$$
P(|X - \mu| \geq k) \leq \frac{\sigma^2}{k^2}
$$

> * **Interpretation**: This inequality provides an upper bound on the probability that a random variable deviates from its mean by at least $k$ standard deviations, based on its variance.

## `Context`:

### **`Def.` Independent and Identically Distributed (i.i.d.) Random Variables**:

Better known as ***i.i.d.***, a sequence of random variables $\{X_1, X_2, \ldots, X_n\}$ is a concept meaning that each data point (or random variable) in a dataset is unrelated to the others (independent) and all data points are drawn from the same probability distribution (identically distributed).



### **`Def.` Mean**:

Mean is a measure of central tendency that represents the "average" or "center" value of a entire set of numbers or probability distribution. It is calculated by summing all the values in the dataset and then dividing by the total number of values:
$$
\mu = \frac{1}{n} \sum_{i=1}^{n} x_i = \frac{x_1 + \cdots + x_n}{n} = \overline{X}, \quad \text{ where } x_i \in X \quad \forall i=1,\ldots,n
$$



In the context of probability distribution, $\mu$ is synonymous with the expected value $E[X]$ of a random variable $X$.

* **Discrete Random Variable**:
  $$
  E[X] = \sum_{i}[x . P(x)] = \sum_{i} x_i . p_X(x_i)
  $$

* **Continuous Random Variable**:
  $$
  E[X] = \int_{-\infty}^{\infty} x . f(x) dx
  $$
  where $f(x)$ is the probability density function (PDF) of the random variable $X$.
  

### **`Def.` Variance**:

Variance is measure of dispersion that quantifies how far a set of numbers is spread out from their average (mean) value.

1. **Low Variance**: Data points are clustered closely around the mean, indicating consistency.
2. **High Variance**: Data points are spread out over a wider range, indicating greater variability.
3. **Zero Variance**: All data points are identical, indicating no variability.

The mathematical formulation depends on whether we are dealing with a population or a sample:

| **Type**       | **Symbol**  | **Formula**   | **Explanation**                              |
|----------------|-------------|---------------|----------------------------------------------|
| Population     | $\sigma^2$  | $\sigma^2 = \frac{\sum_{i=1}^{N} (x_i - \mu)^2}{N}$ | Variance of the entire population. Divided by the total number of observations (N)          |
| Sample         | $s^2$        | $s^2 = \frac{\sum_{i=1}^{n} (x_i - \overline{x})^2}{n - 1}$ | Variance of a sample from the population. Divided by $n-1$ (**Basel's correctio**) to provide an unbiased estimate of population varianve   |

<br>

> **Note**: $\mu$ is the population mean, and $\overline{x}$ is the sample mean.

For a **Random Variable** $X$, the variance is defined as the **expected value of the squared deviation from the mean** $\mu$:
$$
Var(X) = E[(X - \mu)^2] = E[X^2] - (E[X])^2
$$

**`Properties of Variance`**:

1. **Non-Negativity**: Variance is always non-negative, i.e., $Var(X) \geq 0$.
2. **Squared Units**: Variance is expressed in squared units of the original data.(e.g., if data is in meters, variance is in square meters).
3. **Relationship to Standard Deviation**: The standard deviation ($\sigma$) is the square root of the variance, providing a measure of spread in the same units as the data.
4. **Sensitivity to Outliers**: Variance is sensitive to extreme values (outliers) because it squares the deviations from the mean.



## `4. Weak and strong laws of large numbers`


### **`Def.` Law of Large Numbers**:

Assume we have a sequence of independent and identically distributed (i.i.d.) random variables $\{X_1, X_2, \ldots, X_n\}$, with finite mean $\mu$ and a finite variance $\sigma^2$.
Let $\overline{X}_n = \frac{x_1 + \ldots + x_n}{n}$ be the sample mean (of $X_1$ through $X_n$) - it itself is a random variable with mean $\mu$:
$$
E[\overline{X}_n] = \frac{E(X_1 + X_2 + \ldots + X_n)}{n} = \frac{E(X_1) + \ldots + E(X_n)}{n} = \mu
$$

and variance $\frac{\sigma^2}{n}$:
$$
Var(\overline{X}_n) = Var\left(\frac{X_1 + X_2 + \ldots + X_n}{n}\right) = \frac{Var(X_1) + \ldots + Var(X_n)}{n^2} = \frac{n \sigma^2}{n^2} = \frac{\sigma^2}{n}
$$

> * **Note**: The ***Law of Large Numbers (LLN)*** states that as $n$ grows larger, the sample mean $\overline{X}_n$ converges to the true mean $\mu$. **LLN** has two main forms: the **Weak Law of Large Numbers (WLLN)** and the **Strong Law of Large Numbers (SLLN)**.



### **`Theorem 10.2.1` Strong Law of Large Numbers (SLLN)**:

The sample mean $\overline{X}_n$ converges to the true mean $\mu$ pointwise, with probability 1, as the number of observations $n$ approaches infinity:
$$
P\left(\lim_{n \to \infty} \overline{X}_n = \mu\right) = 1
$$

> * Recall that random variables are functions from the sample space $\Omega \in \mathbb{R}$ - **pointwise convergence** says that $\overline{X_n} \to \mu$ for each point $\omega \in \Omega$, except maybe some set $B_0$ with $P(B_0) = 0$.



### **`Theorem 10.2.2` Weak Law of Large Numbers (WLLN)**:

For all $\epsilon > 0$, $P(|\overline{X}_n - \mu| \geq \epsilon) \to 0$ as $n \to \infty$. (This is convergence in probability).

In other words, for any small positive number $\epsilon$, the probability that the sample mean $\overline{X}_n$ deviates from the true mean $\mu$ by at least $\epsilon$ approaches zero as the number of observations $n$ increases indefinitely.

> * **Interpretation**: The LLN is essential for simulations, statistics and science in general. When generating data by replicating an experiment and averaging the result to approximate the theoretical avearge, we appeal to LLN.



## `5. Central limit theorem`

Let's assume: 
1. We have a sequence of independent and identically distributed (i.i.d.) random variables $\{X_1, X_2, \ldots, X_n\}$;
2. Each with finite **mean $\mu$** and finite **variance $\sigma^2$**. 
3. $\overline{X}_n = \frac{X_1 + X_2 + \ldots + X_n}{n}$ be the sample mean of these random variables.

Then, LLN says that as $n \to \infty$, the sample mean $\overline{X}_n$ converges to the true mean $\mu$.

What about its distribution?

### **`Theorem 10.3.1` Central Limit Theorem (CLT)**:

As $n \to \infty$, the distribution of the standardized sample mean approaches a standard normal distribution $N(0, 1)$:

$$
Z_n = \sqrt{n} . \left(\frac{\overline{X}_n - \mu}{\sigma}\right) \xrightarrow{} N(0, 1)
$$

> * **Interpretation**: The CLT states that regardless of the original distribution of the data, the distribution of the sample mean will tend to be normal as the sample size increases. This is crucial for inferential statistics, as it justifies the use of normal distribution-based methods for hypothesis testing and confidence intervals, even when the underlying data is not normally distributed. 

In anoter words, that means that the CDF of the left-hand side (l.h.s.) converges to the CDF of the right-hand side (r.h.s.):
$$
\lim_{n \to \infty} P\left(Z_n \leq z\right) = \Phi(z)
$$
where $\Phi(z)$ is the CDF of the standard normal distribution.
