# `3. Random Variables`

1. ~~Random variables and random elements: motivation and examples.~~
2. ~~Transformations of random variables.~~
3. ~~Distribution of a random variable.~~
4. ~~Characteristic function and moment-generating function.~~
5. ~~Moments and cumulants.~~
6. Families of distributions. 
7. Statistics. 
8. Entropy. 
9. Independent random variables, marginals.


### `1. Random variables and random elements: motivation and examples`

### Motivation: why do we even need random variables?
Dealing directly with a huge outcome space $\Omega$ is messy and exhausting. Also, we **don’t measure $\Omega$ directly** — the probability measure $P$ is defined on the **event space** $\mathcal F$, not on individual outcomes.


**`Example 01 (repeated fair die)`:**

If we roll a fair 6-sided die $n$ times, the outcomes are length-$n$ sequences, so:
$$
\Omega = \{\omega_1,\omega_2,\omega_3,\omega_4,\omega_5,\omega_6\}^n,
\qquad |\Omega|=6^n,
$$
and with the biggest possible $\sigma$-algebra:
$$
\mathcal F = 2^\Omega
$$
(which is the power set).

Then the number of measurable events explodes:
$$
|\mathcal F| = 2^{|\Omega|} = 2^{6^n}.
$$
For $n=2$ that is $2^{36}\approx 68$ billion events — not cool.

Even though for a concrete outcome (a concrete sequence) $\omega$ we have:
$$
P(\{\omega\}) = \left(\frac{1}{6}\right)^n = \frac{1}{|\Omega|},
$$
working with such gigantic structures becomes a nightmare for actual calculations.


### “Adjusting the scope”: focusing on what you need
Instead of tracking every tiny detail in $\Omega$, we often want **coarser questions**, like:
1) “Did we roll only even numbers?”  
2) “Did we never roll more than 4?”  
3) “Was the $(n-42)$-th roll a 4 or a 5?”

So we introduce a **helping tool**: a function that extracts only the information we care about.

**Even/odd detector example (the notes’ motivating construction).**  
Set $n=3$ and define a function $h$ that maps each roll to Even/ Odd:
$$
h : \{2,5,4\} \mapsto \{E,O,E\}.
$$
For any “detected pattern” $B\subset\{E,O\}^3$, we can look at the **preimage**:
$$
h^{-1}(B)=\{\text{all original sequences in }\Omega \text{ that produce pattern }B\}.
$$
Then we define the $\sigma$-algebra generated by this detector:
$$
\sigma(h)=\{h^{-1}(B)\mid B\subset\{E,O\}^3\}.
$$
Now we only care about 8 patterns (since $2^3=8$), and for “all even” we get:
$$
P(\{E,E,E\})=\frac{1}{8}.
$$

This is the key idea: **choose the right function / viewpoint → shrink the complexity.**


### Measurability via preimages (why the trick works)
The notes recall the “continuity via preimages” idea and reuse it for measurability: we call a function measurable when the preimage of a measurable set stays measurable.


### **`Def. 2.2`: Random Variable**
Given an outcome (sample) space $(\Omega,\mathcal F)$ and a measurable space $(Y,\mathcal Y)$, an $\mathcal F$-measurable function
$$
X:(\Omega,\mathcal F)\to (Y,\mathcal Y)
$$
is called a **random variable**.

Equivalently (what “$\mathcal F$-measurable” means here):
$$
\forall B\in\mathcal Y:\quad X^{-1}(B)\in\mathcal F.
$$
So events of the form “$X\in B$” are measurable events in the original probability space.

**Common shorthand used in the notes.**  
Often we restrict to $(\mathbb R,\mathcal B(\mathbb R))$ and write lazily:
$$
X:\Omega\to\mathbb R,
$$
while measurability is still the important hidden requirement.

> * **Note**: Random variables help us ”organize” the outcome (sample) space by mapping it to some other space, like a real line.

### **`Def. 2.2.X`: Random elements** (same definition, more general codomain)
A “random variable” is just a **random element** whose codomain is usually $\mathbb R$.
The notes emphasize that different codomains change what kind of data we model:

- $(\mathbb R,\mathcal B(\mathbb R))$ — one measurement (a single number)
- $(\mathbb R^n,\mathcal B(\mathbb R^n))$ — multiple measurements at once (a vector)
- $(\mathbb R^{m\times n},\mathcal B(\mathbb R^{m\cdot n}))$ — tabular data (rectangles)
- $(S^T,\mathcal C(T,S))$ — function-valued objects with a cylinder $\sigma$-algebra (“borderline stochastic stuff”)  

So: **random variable** = special case; **random element** = same idea for general $Y$.



### “Not so random variables”
The notes point out the classic joke: they are neither truly “random” nor “variables”.
They are **functions** that map outcomes (which can be anything) into something measurable (often numbers), so we can exploit the structure of $\mathbb R$ for calculations.


## `Examples`


### `Discrete Random Variables`:


#### `Example 1: Bernoulli RV`
$$
X:\Omega\to\{0,1\}
$$
with:
- $\Omega=\{\text{Success},\text{Fail}\}$
- $\mathcal F=\{\varnothing,\{\text{Success}\},\{\text{Fail}\},\Omega\}$
- $X(\text{Success})=1,\; X(\text{Fail})=0$

* **Interpretation**: any “hit or miss” / binary classification situation.


#### `Example 2: Binomial RV (count successes)`:

For $n=3$ trials:
$$
X:\Omega\to\{0,1,\dots,n\},\qquad \Omega=\{S,F\}^3,\quad \mathcal F=2^\Omega,
$$
and e.g.
$$
X(SFF)=1,\quad X(SSF)=2,\quad X(SFS)=2.
$$


#### `Example 3: Discrete rank RV`:

Ranks player $A$ in a 4-player match:
$$
X:\Omega\to\{1,2,3,4\},
$$
where $\Omega$ is all $4!$ permutations of $\{A,B,C,D\}$, and $\mathcal F=2^\Omega$.  
Examples:
$$
X(B,A,C,D)=2,\quad X(B,C,A,D)=3,\quad X(B,D,A,C)=3.
$$


### `Continuous Random Variables`:

#### `Example 1: Uniform RV`:
$$
X:\Omega\to[0,1],
\qquad \mathcal F=\{\text{Borel subsets of }[0,1]\},
$$
with example values like $X(0.3)=0.3$, $X(0.734)=0.734$.


#### `Example 2: Exponential RV (waiting time)`:

$$
X:\Omega\to[0,\infty),
\qquad \mathcal F=\{\text{Borel subsets of }[0,\infty)\},
$$
example values like $X(\omega_1)=4.20$, $X(\omega_3)=304$.


#### `Example 3: Normal RV (deviation from mean)`:

$$
X:\Omega\to\mathbb R,
\qquad \mathcal F=\{\text{Borel subsets of }\mathbb R\},
$$
example values like $X(\omega_1)=-5.1$, $X(\omega_2)=0.1$, $X(\omega_3)=2.4$.


## `2. Transformations of random variables`

### **What is it / why do we care?**

* **What is it**: a transformation is when we take an existing random variable $X$ and build a new one by applying a (measurable) function $g$:
$$
Y = g(X).
$$
So we are not “creating new randomness” — we are **repackaging the same underlying randomness** into a new quantity.

* **Why we care** (typical reasons):
  1. **Derived quantities show up naturally** in applications: if $V$ is speed, energy is $E=\tfrac12 mV^2$; if $P_t$ is a price, return can be $\log(P_t/P_{t-1})$, etc.
  2. Often we know the distribution of $X$, but we actually need the distribution or expectation of $g(X)$.
  3. In simulation, we generate a “simple” $X$ (often uniform) and transform it to get more complex distributions.
  4. Many expectations can be computed **without** first finding the density of $Y$ (LOTUS), which saves a lot of work.

---

### **Definition: transformation of a random variable**

Given a probability space $(\Omega,\mathcal F,P)$, a random variable $X:\Omega\to\mathbb R$, and a measurable function $g:\mathbb R\to\mathbb R$, define
$$
Y(\omega) = g(X(\omega)), \qquad \omega\in\Omega.
$$
Then $Y$ is also a random variable on the same probability space.

---

## `Core idea`: probability moves through the function (pushforward)

A distribution tells you “how probability mass sits on the line”.  
Applying $g$ **moves that mass**.

For any measurable set $B\subseteq\mathbb R$:
$$
P_Y(B)=P(Y\in B)
= P(g(X)\in B)
= P\big(X\in g^{-1}(B)\big)
= P_X\big(g^{-1}(B)\big).
$$

**Interpretation**: to know how likely $Y$ is to land in $B$, look at the set of $x$ values that get mapped into $B$ and measure how likely $X$ is to fall there.

---

## `LOTUS` (Law of the Unconscious Statistician)

This is the main shortcut for expectations.

Let $Y=g(X)$. For any integrable function $h$:
$$
\mathbb E[h(Y)] = \mathbb E[h(g(X))].
$$

More explicitly, in terms of the distribution of $X$:
$$
\mathbb E[h(Y)] = \int_{\mathbb R} h(g(x))\, dP_X(x).
$$

If $X$ has a density $f_X$, then:
$$
\mathbb E[h(Y)] = \int_{\mathbb R} h(g(x))\, f_X(x)\, dx.
$$

**Why this matters**: many problems ask for $\mathbb E[g(X)]$ (means, variances, risks, losses). LOTUS lets you compute it directly from $X$.


---


## How to compute the distribution of $Y=g(X)$ in practice

There are three common routes.


---


### `1) CDF method` (works very generally)

Start from:
$$
F_Y(y) = P(Y\le y) = P(g(X)\le y).
$$

* If $g$ is **strictly increasing**, then $g(X)\le y \iff X\le g^{-1}(y)$, so:
$$
F_Y(y) = F_X(g^{-1}(y)).
$$

* If $g$ is **strictly decreasing**, then $g(X)\le y \iff X\ge g^{-1}(y)$, so:
$$
F_Y(y) = P(X\ge g^{-1}(y)) = 1 - F_X(g^{-1}(y))
$$
(with the usual “$\le$ vs $<$” convention depending on continuity).

**When to use**: when you know $F_X$ or when $X$ might not have a density.


---


### `2) Density change-of-variables` (fast when $g$ is one-to-one)

Assume:
- $X$ has density $f_X$,
- $g$ is differentiable and strictly monotone (so $g^{-1}$ exists).

Then $Y=g(X)$ has density:
$$
f_Y(y) = f_X(g^{-1}(y))\left|\frac{d}{dy}g^{-1}(y)\right|.
$$

**Intuition**: small probability masses must match:
$$
P(y\le Y\le y+dy) \approx P(x\le X\le x+dx),
$$
and the mapping rescales lengths by:
$$
dx = \left|\frac{d}{dy}g^{-1}(y)\right|dy.
$$

---

### `3) Non-injective (many-to-one) transformations` (folding the line)

If $g$ is not one-to-one (e.g., $g(x)=x^2$), a single $y$ can have multiple preimages $x_i$ with $g(x_i)=y$.

Under standard regularity assumptions (piecewise monotone, differentiable):
$$
f_Y(y) = \sum_{x_i:\, g(x_i)=y}\frac{f_X(x_i)}{|g'(x_i)|}.
$$

**Intuition**: if the function “folds” the real line, probability mass from multiple branches lands on the same $y$, so densities add up.

---


## Discrete case (pmf transformation)

If $X$ is discrete with pmf $p_X(x)=P(X=x)$, then $Y=g(X)$ is discrete and:
$$
p_Y(y)=P(Y=y)=\sum_{x:\, g(x)=y} P(X=x)=\sum_{x:\, g(x)=y} p_X(x).
$$

**Intuition**: collect all probability from values of $X$ that map to the same $y$.


---

## Intuition summary: “mass gets reshaped”

Think of the distribution of $X$ as “mass on the $x$-axis”.

* If $g$ **stretches** distances, density **decreases**.
* If $g$ **compresses** distances, density **increases**.
* If $g$ **folds** the axis (not one-to-one), densities from different branches **add**.

That is exactly what the Jacobian factor and the “sum over preimages” formula express.

---

## `Example A`: expectation via LOTUS (no need for $f_Y$)

Let $X\sim \mathrm{Unif}[0,1]$, $Y=2X+1$, and let $h(y)=y^2$.
Then:
$$
\mathbb E[h(Y)] = \mathbb E[h(2X+1)]
= \int_0^1 (2x+1)^2\,dx.
$$

---

## `Example B`: simple discrete transformation

Let $X\sim \mathrm{Bernoulli}(p)$ and define $Y=1-X$.
Then:
$$
P(Y=1)=P(X=0)=1-p,\qquad P(Y=0)=P(X=1)=p,
$$
so:
$$
Y\sim \mathrm{Bernoulli}(1-p),
\qquad
\mathbb E[Y]=1-p.
$$

---

## What to say in an oral exam (short + correct)

> “A transformation is $Y=g(X)$. The distribution of $Y$ is the pushforward of the distribution of $X$: for measurable $B$,
> $$
> P(Y\in B)=P(X\in g^{-1}(B)).
> $$
> Practically, I can compute $F_Y(y)=P(g(X)\le y)$ via CDF manipulation, or if $X$ has a density and $g$ is monotone I use the Jacobian:
> $$
> f_Y(y)=f_X(g^{-1}(y))\left|\frac{d}{dy}g^{-1}(y)\right|.
> $$
> If $g$ is not one-to-one, I sum over all preimages. For expectations I use LOTUS:
> $$
> \mathbb E[h(Y)] = \int h(g(x))\,dP_X(x).
> $$

## `3. Distribution of a random variable`

### Motivation: we want probabilities of events like `X ∈ B`
Given a probability space $(\Omega,\mathcal F,P)$ and a random variable
$$
X:\Omega \to \mathbb R,
$$
we often care about events of the form:
$$
\{\omega\in\Omega: X(\omega)\in B\} = X^{-1}(B),
$$
where $B$ is a (Borel) set in $\mathbb R$.
This event is measurable (belongs to $\mathcal F$), so it has a probability.

> * **`Def.`: Borel Set**: A Borel set is any set in the Borel $\sigma$-algebra $\mathcal B(\mathbb R)$, which is generated by all open intervals in $\mathbb R$ through countable unions, intersections, and complements.



### **`Def. 2.3`: Distribution (law) of a Random Variable**
Given a probability space $(\Omega,\mathcal F,P)$, a measurable space $(\mathbb Y,\mathcal Y)$, and a random variable $X$, its distribution (law) is defined as a **pushforward probability measure** $P_X$

**Meaning**:

$$
\forall B\in(\mathbb Y,\mathcal Y) \text{, we have } P_X(B) = P(X\in B)=P(X^{-1}(B)).
$$


> * **Important note**: having the same distribution does **not** mean two random variables are equal as functions.


## `Distributions and Probability Mass Functions (PMFs)`

* There are 2 main types of distributions: ***discrete*** and ***continuous***.


## `3.1 Discrete distributions`

#### **`Def. 3.2.1`: Discrete random variable**

A random variable $X$ is said to be **discrete** if it takes values in a finite or countably infinite set
$$
\{a_1,a_2,\dots\}
\quad\text{such that}\quad
P(X=a_j\text{ for some }j)=1.
$$

If $X$ is a discrete random variable, then this finite or countable set of values such that $P(X = x) > 0$ is called the **support** of $X$.


> * **Note**: ***Continuous Random Variables*** can take any value in an interval.


---

* The *distribution* of a **Random Variable** specifies the probabilities of all events addociated with it. For a discrete random variable, this is captured by the **Probability Mass Function (PMF)**.

---


#### **`Def. 3.2.2`: Probability Mass Function**

The **probability mass function (PMF)** of a discrete random variable $X$ is the function $p_X$ given by:
$$
p_X(x)=P(X=x).
$$
It is $>0$ on the support of $X$ and $0$ otherwise.

>* **Notes**: In writing $P(X=x)$, we mean the probability of the **event** $\{\omega\in\Omega: X(\omega)=x\}$.

---

#### **`Thm. 3.2.7`: Valid PMFs**
Let $X$ be a discrete random variable with support $\{x_1,x_2,\dots\}$. The *Probability Mass Function* (PMF) $p_X$ of $X$ must satisfy:

1. **Non-negativity**: 
$$p_X(x)\ge 0 \text{ for all } x = x_j, \text{ for some } j. \quad p_X(x) = 0 \text{ otherwise.}$$

2. **Normalization**: 
$$
\sum_{j=1}^\infty p_X(x_j)=1
$$

---


## `3.2 Cumulative Distribution Functions (CDF)` 

* Works for all random variables.

---

#### **`Def. 3.6.1`: Cumulative distribution function (CDF)**

The **Cumulative distribution function** (CDF) of a random variable $X$ is the function $F_X$ given by:
$$
F_X(x)=P(X\le x).
$$

>* **Note**: Only *discrete random variables* have **Probability Mass Functions** (PMFs), but **all random variables have CDFs**.

---

#### **`Thm. 3.6.3`: Valid CDFs**
Any Cumulative Distribution Function (CDF) $F$ has these properties:

1) **Increasing**:
$$
\text{If } x_1\le x_2 \text{ then } F(x_1)\le F(x_2)
$$

2) **Right-continuous**: The CFD is **Continuous**, except for having some jumps. At the point of a jump, the CDF is continuous from the **right** $\forall a\in\mathbb R$:
$$
F(a)=\lim_{x\to a^+}F(x)
$$

3) **Convergence to $0$ and $1$**:
$$
\lim_{x\to-\infty}F(x)=0 \text{   and   } \lim_{x\to\infty}F(x)=1
$$

4) **Normalization**:
$$
\forall x\in\mathbb R:\quad 0\le F(x)\le 1 \text{- Range is bounded}
$$

---

## `3.3 Relationship between PMFs and CDFs`:

For **discrete random variables**, we can easily convert between PMFs and CDFs:

1. **From PMF to CDF**:
- To find, for example, $P(X\le x_0)$, (where $x_0$ is some real number), we sum the PMF values for all support points $x_j$ that are $\le x_0$:

$$
F_X(x) = P(X\le x) = \sum_{x_j \le x} p_X(x_j)
$$

2. **From CDF to PMF**:
- The CDF of a discrete random variable consists of jumps and flat regions. The ***Height of a jump at x*** = ***Value of the PMF at $x_j$***:
$$
p_X(x_j) = F_X(x_j) - \lim_{x\to x_j^-} F_X(x)
$$





## `4. Characteristic function (CF) and moment-generating function (MGF)`

### **What is it / why do we care?**

**Big idea:** Sometimes the CDF/PDF is the “wrong coordinate system” for a problem.  
CF and MGF are **transforms** of the distribution that:
- compress the whole distribution into a single function,
- turn *hard operations* (like convolution / sums of independent RVs) into *easy algebra* (multiplication / addition),
- encode moments (when they exist) through derivatives/Taylor expansions,
- are powerful for limit theorems (CLT, convergence in distribution), because “convergence of CFs” is a standard tool. 

---

## `4.1 Characteristic function`

### **Definition**
Let $X$ be a real-valued random variable. Its **characteristic function** is
$$
\varphi_X(t) \;=\; \mathbb E\!\left[e^{itX}\right]
\;=\; \int_{\mathbb R} e^{itx}\, dP_X(x),
\qquad t\in\mathbb R.
$$ 

**How to interpret it (intuition):**
- The kernel $e^{itx}$ lies on the unit circle in $\mathbb C$.
- So $\varphi_X(t)$ is like a **weighted “average rotation”** of unit vectors $e^{itx}$ under the distribution of $X$.
- This is a **Fourier transform of the probability measure** $P_X$, i.e. you’re viewing the distribution in a “frequency domain”.

### **Key properties (what they *mean*)**
1. **Always exists:** for every $X$ and every $t\in\mathbb R$, $\varphi_X(t)$ is defined (because $|e^{itX}|=1$). 
2. **Bounded:** 
$$
|\varphi_X(t)|\le 1.
$$
(You’re averaging points on the unit circle, so you can’t get magnitude bigger than 1.) 
3. **Normalization:** 
$$
\varphi_X(0)=1.
$$
(At $t=0$, the kernel is $1$ everywhere, so the average is $1$.)  
4. **Uniqueness / injectivity:** $\varphi_X$ determines the distribution $P_X$ (you can recover $P_X$ via an inverse Fourier-type result).
5. **Continuity:** $\varphi_X$ is (uniformly) continuous in $t$.

### **The main “why it’s useful”: sums become products**
If $X$ and $Y$ are independent, then
$$
\varphi_{X+Y}(t) = \varphi_X(t)\,\varphi_Y(t).
$$
So:
- adding independent variables (a convolution in the PDF world) becomes **multiplication** in CF world,
- and for i.i.d. sums $S_n=\sum_{k=1}^n X_k$:
$$
\varphi_{S_n}(t) = \big(\varphi_X(t)\big)^n.
$$
This is one of the reasons CFs are central in proofs of CLT and other limit results.  

### **Moments from derivatives (when moments exist)**
If $\mathbb E[|X|^n]<\infty$, then derivatives at $0$ encode moments:
$$
\varphi_X^{(n)}(0) = i^n\,\mathbb E[X^n].
$$ 

Equivalently (formal Taylor expansion near $0$):
$$
\varphi_X(t)=\sum_{n=0}^{\infty}\frac{(it)^n}{n!}\,\mathbb E[X^n]
\quad\text{(when the expansion is justified).}
$$ 

---

## `4.2 Moment-generating function (MGF)`

### **Definition**
The **moment-generating function** is
$$
M_X(t) \;=\; \mathbb E[e^{tX}]
\;=\; \int_{\mathbb R} e^{tx}\, dP_X(x),
\qquad t\in\mathbb R \text{ (where finite).}
$$

**Key intuition:**
- Compare kernels: CF uses $e^{itX}$ (bounded), MGF uses $e^{tX}$ (can explode).
- So MGF “sees” **tail growth** very strongly: large positive values of $X$ get exponentially amplified when $t>0$. 

### **Important warning: MGF may not exist**
Because $e^{tX}$ is unbounded, $M_X(t)$ can be infinite or undefined for some (or all) $t\ne 0$.
Usually we require: **there exists an open interval around $0$ where $M_X(t)<\infty$**.

### **Moments from derivatives**
If $M_X(t)$ is finite in a neighborhood of $0$, then:
$$
M_X^{(n)}(0) = \mathbb E[X^n].
$$

So MGF is literally a “moment machine”: the Taylor coefficients at $0$ are raw moments:
$$
M_X(t)=\sum_{n=0}^{\infty}\frac{t^n}{n!}\,\mathbb E[X^n]
\quad\text{(when analytic near 0).}
$$

### **Sums of independent RVs**
If $X$ and $Y$ are independent and MGFs exist where needed:
$$
M_{X+Y}(t)=M_X(t)\,M_Y(t).
$$
Same “convolution becomes multiplication” advantage as CF.

### **CF vs MGF in one line**
- CF: always exists, great for distribution/limits, moments via $i^n$ derivatives.
- MGF: may fail to exist, but when it exists it’s very convenient for moments and exponential tail behavior.

---

# `5. Moments and cumulants`

## `5.1 Moments`

### **What is it / why do we care?**
**Moments** summarize a distribution numerically:
- mean = “center of mass” (balance point),
- variance = “spread around the mean” (squared-distance style),
- skewness = “asymmetry / tilt”,
- kurtosis = “tail heaviness / peakiness”.

They’re used to:
- compare distributions,
- approximate distributions (moment matching),
- detect tail risk (fat tails),
- build estimators/statistics.

### **Raw moments**
The $n$-th **raw moment** is
$$
m_n := \mathbb E[X^n].
$$

### **Central moments**
Let $\mu := \mathbb E[X]$. The $n$-th **central moment** is
$$
\mu_n := \mathbb E[(X-\mu)^n].
$$

A key relation: central moments can be written from raw moments by binomial expansion:
$$
\mu_n
= \mathbb E[(X-\mu)^n]
= \sum_{k=0}^n {n\choose k}(-\mu)^{\,n-k}\,m_k.
$$ 

### **Most used ones**
1. **Mean**
$$
\mu = \mathbb E[X] = m_1.
$$

2. **Variance**
$$
\mathrm{Var}(X)=\mu_2=\mathbb E[(X-\mu)^2]=\mathbb E[X^2]-(\mathbb E[X])^2 = m_2 - m_1^2.
$$

3. **Skewness (dimensionless)**
Let $\sigma=\sqrt{\mu_2}$. Then
$$
\gamma_1 := \frac{\mu_3}{\sigma^3}.
$$

4. **Kurtosis (dimensionless)**
$$
\gamma_2 := \frac{\mu_4}{\sigma^4}.
$$ 

### **Caution: moments may not exist**
Some distributions look “nice” but have divergent moments (classic example: Cauchy).
So “I can compute the mean/variance” is not automatic — it requires integrability assumptions.


---

## `5.2 Cumulants`

### **What is it / why do we care?**
Cumulants are another set of numerical summaries, but with a *huge structural advantage*:

> **Cumulants add under independent sums.**

That makes them extremely useful for:
- analyzing sums of independent RVs,
- approximation/expansions (Edgeworth / saddlepoint ideas),
- separating “shape information” cleanly (mean/variance/skewness/kurtosis appear naturally).  

Moments mix together under sums; cumulants behave cleanly.

---

### **Cumulant generating function (CGF)**
Assume the MGF exists near $0$. Define the **cumulant generating function**
$$
K_X(t) := \log M_X(t).
$$

The **$n$-th cumulant** is
$$
\kappa_n := K_X^{(n)}(0).
$$

So cumulants are “derivatives of $\log$ MGF at 0”.

---

### **First cumulants (interpretation)**
When they exist:
- $\kappa_1 = \mathbb E[X]$ (mean),
- $\kappa_2 = \mathrm{Var}(X)$ (variance),
- $\kappa_3 = \mu_3$ (third central moment),
- $\kappa_4 = \mu_4 - 3\mu_2^2$ (related to “excess kurtosis structure”).

(So cumulants are tightly linked to central moments; higher-order ones remove the “redundant combinations” that moments contain.)

---

### **The killer property: additivity**
If $X$ and $Y$ are independent and MGFs exist near $0$, then:
$$
M_{X+Y}(t)=M_X(t)M_Y(t)
\quad\Rightarrow\quad
K_{X+Y}(t)=K_X(t)+K_Y(t).
$$
Taking derivatives at $0$ gives:
$$
\kappa_n(X+Y)=\kappa_n(X)+\kappa_n(Y).
$$

This is why cumulants are often the right tool whenever you see “sum of independent random variables”.

---

## What to say in an oral exam (short + correct)

**CF:**
> “The characteristic function is $\varphi_X(t)=\mathbb E[e^{itX}]$, the Fourier transform of $P_X$. It always exists, determines the distribution, and turns sums of independent RVs into products: $\varphi_{X+Y}=\varphi_X\varphi_Y$. Derivatives at $0$ encode moments when they exist.”

**MGF:**
> “The MGF is $M_X(t)=\mathbb E[e^{tX}]$, a Laplace-type transform. It may fail to exist, but if it exists near $0$ then $M_X^{(n)}(0)=\mathbb E[X^n]$ and $M_{X+Y}=M_XM_Y$ for independent sums.”

**Moments vs cumulants:**
> “Moments summarize shape (mean/variance/skewness/kurtosis) but can mix under sums; cumulants come from $K_X(t)=\log M_X(t)$ and are additive for independent sums, which makes them especially useful for analyzing sums and approximations.”



In [None]:
# ================================================================================================ #
# ================================================================================================ #
# ================================================================================================ #
# ================================================================================================ #
# ================================================================================================ #
# ================================================================================================ #

# `Second Lector`:

---

### **`Def. 2.2` Random Variable**:

Given a sample space $(\Omega,\mathcal{F})$ and a measurable space $(Y,\mathcal{Y})$, a function $\mathcal{F}$-measurable function $X:(\Omega,\mathcal{F})\to(Y,\mathcal{Y})$ is called a **random variable**.

---

### **`Def. 2.3` Distribution of a Random Variable**:

Given a probability space $(\Omega,\mathcal{F},P)$, a measurable space $(\mathbb{Y},\mathcal{Y})$, and a random variable $X$, its distribution (law) is defined as a **pushforward probability measure** $P_X$ on $(\mathbb{Y},\mathcal{Y})$ by:
$$\forall B\in\mathcal{Y}:\quad P_X(B) = P(X\in B) = P(X^{-1}(B)).$$

---

### **`Def. 2.4` Cumulative Distribution Function (CFD)**

For all given probability space $(\Omega,\mathcal{F},P)$ and random variable $X$ together with its distribution $P_X$ and a number $x\in\mathbb{R}$, the **Cumulative Distribution Function** $F_X :\mathbb{R}\to[0,1]$ is defined as:

$$
F_X(x) = P(X^{-1}(-\infty,x]) = P_X((-\infty,x]) 
$$

Which satisfies:

1. **Non-decreasing**: 
$$
\text{if } x_1\le x_2 \text{ then } F_X(x_1)\le F_X(x_2)
$$

2. **Right-continuous**:
$$
F_X(a) = \lim_{x\to a^+} F_X(x)
$$

3. **Limits at infinity**:
$$
\lim_{x\to -\infty} F_X(x) = 0 \quad\text{and}\quad \lim_{x\to +\infty} F_X(x) = 1
$$

4. **Normalization**:
$$
0 \le F_X(x) \le 1 \text{ for all } x\in\mathbb{R}
$$

---

### **`Def. 2.5` Probability Mass Function (PMF)**:

Given a probability space $(\Omega,\mathcal{F},P)$ and a discrete random variable $X$ taking values on a countable set $S=\{x_1,x_2,\dots\}$, the **Probability Mass Function (PMF)** of $X$ is a function $p_X:\mathbb{R}\to[0,1]$ is defined as:

$$
p_X(x) = P(X=x) \text{ for all } x\in S
$$

With the properties:

1. **Non-negativity**: 
$$
p_X(x) \ge 0 \text{ for all } x\in S
$$

2. **Normalization**:
$$
\sum_{x\in S} p_X(x) = 1
$$

3. **Zero outside support**:
$$
p_X(x) = 0 \text{ if } x\notin S
$$

> **Note**: Reconvering CDF: $$F_X(x) = \sum_{x_j\le x} p_X(x_j)$$

---

### **`Def. 2.6` Probability Density Function (PDF)**:

Given a probability space $(\Omega,\mathcal{F},P)$ and a continuous real valued random variable $X$ with distribution $P_X \le \lambda$ and a $F_X$ for CDF, a **Probability Density Function (PDF)** is a function $f_X:\mathbb{R}\to[0,\infty)$ where:

1. 
$$
f_X(x) \ge 0 \text{ for all } x\in\mathbb{R}
$$

2. 
$$
\int_{-\infty}^{\infty} f_X(x) d(x) = 1
$$


3.
$$
P(X\in B) = \int_B f_X(x) d(x) 
$$



# `Second Lector`:

---

### **`Def. 2.7` Expected Value ($E$ / Mean $\mu$ )**:

Given a probability space $(\Omega,\mathcal{F},P)$ and a random variable $X$, the **Expected Value** (or mean) is the weighted average of all possible values of $X$. It is defined as:

1. **Discrete Case**: 
$$E[X] = \sum_{i} x_i P(X = x_i)$$
2. **Continuous Case**: 
$$E[X] = \int_{-\infty}^{\infty} x f_X(x) dx$$

**Properties**:
* **Linearity**: $E[aX + bY] = aE[X] + bE[Y]$ for constants $a, b$.
* **Expectation of a function ($X \to g(X)$)**: $E[g(X)] = \int g(x) f_X(x) dx$.

Where $f_X(x)$ is the probability density function of $X$.

---

### **`Def. 2.8` Variance ($\sigma^2$)**:

The **Variance** of a random variable $X$ measures the spread of the distribution around its mean $\mu = E[X]$. It is defined as:

$$Var(X) = E[(X - E[X])^2]$$

**Properties**:
1. **Alternative Formula**: $Var(X) = E[X^2] - (E[X])^2$
2. **Non-negativity**: $Var(X) \ge 0$
3. **Scaling**: $Var(aX + b) = a^2 Var(X)$

---

### **`Def. 2.9` Covariance ($Cov$)**:

Given two random variables $X$ and $Y$ defined on the same probability space, the **Covariance** measures their joint linear variability:

$$Cov(X, Y) = E[(X - E[X])(Y - E[Y])]$$

**Properties**:
1. **Alternative Formula**: $Cov(X, Y) = E[XY] - E[X]E[Y]$
2. **Symmetry**: $Cov(X, Y) = Cov(Y, X)$
3. **Relation to Variance**: $Cov(X, X) = Var(X)$
4. **Variance of a Sum**: $Var(X + Y) = Var(X) + Var(Y) + 2Cov(X, Y)$

---

### **`Def. 2.10` Independence of Random Variables**:

Two random variables $X$ and $Y$ are **independent** if for all measurable sets $A, B$, the events $\{X \in A\}$ and $\{Y \in B\}$ are independent. This implies:

$$P(X \in A, Y \in B) = P(X \in A)P(Y \in B)$$

**Properties under Independence**:
1. $E[XY] = E[X]E[Y]$
2. $Cov(X, Y) = 0$
3. $Var(X + Y) = Var(X) + Var(Y)$

---

### **`Def. 2.11` Marginal Distributions**:

Given a joint distribution (PDF $f_{X,Y}$ or PMF $p_{X,Y}$) of two random variables $X$ and $Y$, the **Marginal Distribution** of one variable is obtained by "summing out" or "integrating out" the other variable.

1. **Marginal PMF (Discrete)**: 
$$p_X(x) = \sum_{y \in S_y} p_{X,Y}(x, y)$$
2. **Marginal PDF (Continuous)**: 
$$f_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x, y) dy$$

> **Note**: The marginal distribution provides the probability behavior of $X$ ignoring all information about $Y$.


---

### **`Def. 2.12` Random Variable (The Source)**:

A **Random Variable (RV)** is a numerical description of the outcome of a statistical experiment. It maps outcomes from a sample space to real numbers.
*   **Discrete RV**: Values you can count (e.g., $0, 1, 2$).
*   **Continuous RV**: Values in a range (e.g., $1.75\dots$ meters).

---

### **`Def. 2.13` Probability Mass Function (PMF) - [Discrete Only]**:

For discrete random variables, the **PMF** gives the probability that the variable is exactly equal to some value.

$$p_X(x) = P(X = x)$$

**Key Rules**:
1. All probabilities are between 0 and 1.
2. The sum of all probabilities must equal 1: $\sum p_X(x) = 1$.

---

### **`Def. 2.14` Probability Density Function (PDF) - [Continuous Only]**:

For continuous random variables, the probability of hitting a *single exact point* is zero ($P(X=1.5) = 0$). Instead, we use the **PDF** to find probability over an **interval**.

$$P(a \le X \le b) = \int_{a}^{b} f_X(x) dx$$

**Key Rules**:
1. $f_X(x) \ge 0$ (Density cannot be negative).
2. The total area under the curve must be 1: $\int_{-\infty}^{\infty} f_X(x) dx = 1$.

---

### **`Def. 2.15` Cumulative Distribution Function (CDF) - [Universal]**:

The **CDF** is the most powerful tool because it applies to *both* discrete and continuous variables. It calculates the "running total" of probability up to a point $x$.

$$F_X(x) = P(X \le x)$$

**The Relationship "The Calculus of Stats"**:
*   **To get CDF from PDF**: Integrate. $F_X(x) = \int_{-\infty}^{x} f_X(t) dt$.
*   **To get PDF from CDF**: Differentiate. $f_X(x) = \frac{d}{dx} F_X(x)$.

---

### **`Def. 2.16` Joint, Marginal, and Conditional Distributions**:

When studying two variables ($X$ and $Y$) simultaneously:

1.  **Joint Distribution**: The probability that $X$ and $Y$ happen at the same time.
    *   $f_{X,Y}(x,y)$
2.  **Marginal Distribution**: The distribution of just $X$, ignoring $Y$.
    *   Find it by "collapsing" $Y$: $f_X(x) = \int f_{X,Y}(x,y) dy$.
3.  **Conditional Distribution**: The distribution of $X$ *given* that we know $Y$ has already happened.
    *   $f_{X|Y}(x|y) = \frac{f_{X,Y}(x,y)}{f_Y(y)}$

---

### **Summary for Memorization**:

| Term | Symbol | Question it Answers | Data Type |
| :--- | :--- | :--- | :--- |
| **PMF** | $p(x)$ | "What is the chance of exactly $x$?" | Discrete |
| **PDF** | $f(x)$ | "How dense is the probability near $x$?" | Continuous |
| **CDF** | $F(x)$ | "What is the chance of getting $x$ or less?" | Both |
| **Joint** | $f(x,y)$ | "What is the chance of $x$ AND $y$?" | Both |


---

### **`Def. 2.17` Moments of a Random Variable**:

Moments are a set of statistical measures used to describe the shape of a distribution.
1.  **1st Moment**: $E[X]$ (Mean) - Location of the center.
2.  **2nd Central Moment**: $E[(X-\mu)^2]$ (Variance) - Width of the spread.
3.  **3rd Central Moment**: **Skewness** - Measures asymmetry (leaning left or right).
4.  **4th Central Moment**: **Kurtosis** - Measures "tailedness" (how many outliers exist).

---

### **`Def. 2.18` Law of the Unconscious Statistician (LOTUS)**:

A critical theorem used to calculate the expected value of a **function** of a random variable without needing to find the distribution of $g(X)$ first:

$$E[g(X)] = \int_{-\infty}^{\infty} g(x) f_X(x) dx \quad \text{(Continuous)}$$
$$E[g(X)] = \sum_{x} g(x) p_X(x) \quad \text{(Discrete)}$$

---

### **`Def. 2.19` Moment Generating Function (MGF)**:

The **MGF** is a functional "DNA" of a random variable. If two variables have the same MGF, they have the same distribution. It is defined as:

$$M_X(t) = E[e^{tX}]$$

**Why it matters**: 
To find the $n$-th moment, you simply take the $n$-th derivative of $M_X(t)$ and evaluate it at $t=0$:
$$E[X^n] = M_X^{(n)}(0)$$

---

### **`Def. 2.20` Correlation ($\rho$)**:

While **Covariance** tells you the direction of a relationship, it is scale-dependent. **Correlation** is the "standardized" version that scales the relationship between $-1$ and $+1$.

$$\rho_{X,Y} = \text{Corr}(X,Y) = \frac{Cov(X,Y)}{\sigma_X \sigma_Y}$$

*   **$\rho = 1$**: Perfect positive linear relationship.
*   **$\rho = -1$**: Perfect negative linear relationship.
*   **$\rho = 0$**: No linear relationship.

---

### **`Def. 2.21` Standard Deviation ($\sigma$)**:

The **Standard Deviation** is simply the square root of the variance. We use it because it returns the "spread" measurement back to the **original units** of the data.

$$\sigma = \sqrt{Var(X)}$$

---

### **Summary of Important Inequalities**:

1.  **Markov's Inequality**: Provides an upper bound for the probability that a non-negative RV is greater than a constant.
    $$P(X \ge a) \le \frac{E[X]}{a}$$
2.  **Chebyshev's Inequality**: Guarantees that nearly all values are close to the mean (used for outlier detection).
    $$P(|X - \mu| \ge k\sigma) \le \frac{1}{k^2}$$


# `Third Lector`: Advanced Topics in Random Variables

---

### **`Topic 3.1` Families of Distributions**:

**Families of Distributions** are standardized probability models that describe common real-world phenomena. They have pre-defined parameters that determine their shape, center, and spread.

#### A. Discrete Families

| Name | Notation | Parameters | Common Use |
| :--- | :--- | :--- | :--- |
| **Bernoulli** | $Bern(p)$ | $p \in [0, 1]$ | Single coin flip outcome (Success/Failure). |
| **Binomial** | $Bin(n, p)$ | $n \in \mathbb{N}, p \in [0, 1]$ | Count of successes in $n$ independent trials. |
| **Poisson** | $Pois(\lambda)$ | $\lambda > 0$ | Count of rare events over a fixed time/space interval (e.g., calls per hour). |

#### B. Continuous Families

| Name | Notation | Parameters | Common Use |
| :--- | :--- | :--- | :--- |
| **Uniform** | $U(a, b)$ | $a, b \in \mathbb{R}$ | Events equally likely to occur within an interval. |
| **Exponential** | $Exp(\lambda)$ | $\lambda > 0$ | Time until the next event occurs (memoryless property). |
| **Normal (Gaussian)** | $N(\mu, \sigma^2)$ | $\mu \in \mathbb{R}, \sigma^2 > 0$ | Central Limit Theorem; most common natural distribution (heights, errors). |

---

### **`Topic 3.2` Statistics**:

In this context, a **Statistic** is any function of the observable data (a sample) that does not depend on unknown population parameters. It is itself a random variable.

| Term | Definition | Example |
| :--- | :--- | :--- |
| **Statistic** | A quantity calculated from sample data used to estimate a population parameter. | Sample Mean ($\bar{x}$), Sample Variance ($s^2$). |
| **Parameter** | A fixed, unknown value describing the entire population. | Population Mean ($\mu$), Population Variance ($\sigma^2$). |
| **Estimator** | A specific type of statistic used to estimate a parameter. | $\bar{x}$ is an estimator for $\mu$. |
| **Bias** | The difference between the expected value of an estimator and the true parameter value: $Bias(\hat{\theta}) = E[\hat{\theta}] - \theta$. |
| **Efficiency** | The inverse of the variance of an estimator (lower variance is better). |

---

### **`Topic 3.3` Entropy ($H$)**:

**Entropy** in information theory measures the average level of "information," "surprise," or uncertainty inherent in a random variable's possible outcomes.

*   **Higher Entropy**: More uncertainty/randomness (e.g., flipping a fair coin).
*   **Lower Entropy**: Less uncertainty/more predictability (e.g., a coin weighted to land on heads 99% of the time).

#### Definitions:
*   **Discrete Entropy (Shannon Entropy)**: 
    $$H(X) = -E[\log P(X)] = -\sum_{x} P(x) \log P(x)$$
*   **Differential Entropy (Continuous)**:
    $$H(X) = -\int_{-\infty}^{\infty} f_X(x) \log f_X(x) dx$$

---

### **`Topic 3.4` Independent Random Variables, Marginals**:

These topics link back to concepts covered in the previous Lector notes but emphasize their role in multivariate analysis:

#### A. Independence Review
$X$ and $Y$ are independent ($X \perp Y$) if their joint probability factors into the product of their individual (marginal) probabilities:
$$P(X \in A, Y \in B) = P(X \in A)P(Y \in B)$$

#### B. Marginal Distributions Review
The marginal distribution of one variable is derived from a joint distribution by integrating or summing over all possible values of the other variable:

$$f_X(x) = \int f_{X,Y}(x, y) dy$$

**Key Takeaway**: If you know $X$ and $Y$ are independent, finding their marginals is trivial because $f_{X,Y}(x,y) = f_X(x) \cdot f_Y(y)$.

---

### **`Topic 3.5` Characteristic Function (CF) and Moment-Generating Function (MGF)**:

These are alternative ways to characterize a distribution fully. The **Characteristic Function ($\phi_X(t)$)** is often preferred theoretically because it *always* exists, unlike the MGF.

*   **Moment-Generating Function (MGF)**:
    $$M_X(t) = E[e^{tX}]$$
    Used to find moments: $E[X^n] = M_X^{(n)}(0)$.
*   **Characteristic Function (CF)**:
    $$\phi_X(t) = E[e^{itX}]$$
    Involves imaginary numbers ($i^2 = -1$). It uniquely determines the distribution and is fundamental for proofs using the Central Limit Theorem.
