## 4. Expectation

### 4.1 Expectation of a Random Variable

The **expected value**, **mean** or **first moment** of $X$ is defined to be

$$ \mathbb{E}(X) = \int x \; dF(x) = \begin{cases}
\sum_x x f(x) &\text{if } X \text{ is discrete} \\
\int x f(x)\; dx &\text{if } X \text{ is continuous}
\end{cases} $$

assuming that the sum (or integral) is well-defined.  We use the following notation to denote the expected value of $X$:

$$ \mathbb{E}(X) = \mathbb{E}X = \int x\; dF(x) = \mu = \mu_X $$

The expectation is a one-number summary of the distribution.  Think of $\mathbb{E}(X)$ as the average value you'd obtain if you computed the numeric average $n^{-1} \sum_{i=1}^n X_i$ for a large number of IID draws $X_1, \dots, X_n$.  The fact that $\mathbb{E}(X) \approx n^{-1} \sum_{i=1}^n X_i$ is a theorem called the law of large numbers which we will discuss later.   We use $\int x \; dF(x)$ as a convenient unifying notation between the discrete case $\sum_x x f(x)$ and the continuous case $\int x f(x) \; dx$ but you should be aware that $\int x \; dF(x)$ has a precise meaning discussed in real analysis courses.

To ensure that $\mathbb{E}(X)$ is well defined, we say that $\mathbb{E}(X)$ exists if $\int_x |x| \; dF_X(x) < \infty$.  Otherwise we say that the expectation does not exist.  From now on, wheneverwe discuss expectations, we implicitly assume they exist.

**Theorem 4.6 (The rule of the lazy statician)**.  Let $Y = r(X)$.  Then

$$ \mathbb{E}(Y) = \mathbb{E}(r(X)) = \int r(x) \; dF_X(x) $$

As a special case, let $A$ be an event and let $r(x) = I_A(x)$, where $I_A(x) = 1$ if $x \in A$ and $I_A(x) = 0$ otherwise.  Then

$$ \mathbb{E}(I_A(X)) = \int I_A(x) f_X(x) dx = \int_A f_X(x) dx = \mathbb{P}(X \in A) $$

In other words, probability is a special case of expectation.

Functions of several variables are handled in a similar way.  If $Z = r(X, Y)$ then

$$ \mathbb{E}(Z) = \mathbb{E}(r(X, Y)) = \int \int r(x, y) \; dF(x, y) $$

The **$k$-th moment** of $X$ is defined to be $\mathbb{E}(X^k)$, assuming that $\mathbb{E}(|X|^k) < \infty$.  We shall rarely make much use of moments beyond $k = 2$.

### 4.2 Properties of Expectations

**Theorem 4.10**.  If $X_1, \dots, X_n$ are random variables and $a_1, \dots, a_n$ are constants, then

$$ \mathbb{E}\left( \sum_i a_i X_i \right) = \sum_i a_i \mathbb{E}(X_i) $$

**Theorem 4.12**.  Let $X_1, \dots, X_n$ be independent random variables.  Then,

$$ \mathbb{E}\left(\prod_i X_i \right) = \prod_i \mathbb{E}(X_i) $$

Notice that the summation rule does not require independence but the product does.

### 4.3 Variance and Covariance

Let $X$ be a random variable with mean $\mu$.  The **variance** of $X$ -- denoted by $\sigma^2$ or $\sigma_X^2$ or $\mathbb{V}(X)$ or $\mathbb{V}X$ -- is defined by

$$ \sigma^2 = \mathbb{E}(X - \mu)^2 = \int (x - \mu)^2\; dF(x) $$

assuming this expectation exists.  The **standard deviation** is $\text{sd}(X) = \sqrt{\mathbb{V}(X)}$ and is also denoted by $\sigma$ and $\sigma_X$.

**Theorem 4.14**.  Assuming the variance is well defined, it has the following properties:

1.  $\mathbb{V}(X) = \mathbb{E}(X^2) - \mathbb{E}(X)^2$
2.  If $a$ and $b$ are constants then $\mathbb{V}(aX + b) = a^2 \mathbb{V}(X)$
3.  If $X_1, \dots, X_n$ are independent and $a_1, \dots, a_n$ are constants then

    $$ \mathbb{V}\left( \sum_{i=1}^n a_iX_i \right) = \sum_{i=1}^n a_i^2 \mathbb{V}(X_i) $$

If $X_1, \dots, X_n$ are random variables then we define the **sample mean** to be

$$ \overline{X}_n = \frac{1}{n} \sum_{i=1}^n X_i  $$

and the **sample variance** to be

$$ S_n^2 = \frac{1}{n - 1} \sum_{i=1}^n \left(X_i - \overline{X}_n\right)^2 $$

**Theorem 4.16**.  Let $X_1, \dots, X_n$ be IID and let $\mu = \mathbb{E}(X_i)$, $\sigma^2 = \mathbb{V}(X_i)$.  Then

$$ 
\mathbb{E}\left(\overline{X}_n\right) = \mu,
\quad
\mathbb{V}\left(\overline{X}_n\right) = \frac{\sigma^2}{n},
\quad \text{and} \quad
\mathbb{E}\left(S_n^2\right) = \sigma^2
$$

If $X$ and $Y$ are random variables, then the covariance and correlation between $X$ and $Y$ measure how strong the linear relationship between $X$ and $Y$ is.

Let $X$ and $Y$ be random variables with means $\mu_X$ and $\mu_Y$ and standard deviation $\sigma_X$ and $\sigma_Y$.  Define the **covariance** between $X$ and $Y$ by

$$ \text{Cov}(X, Y) = \mathbb{E}[(X - \mu_X)(Y - \mu_Y)] $$

and the **correlation** by

$$ \rho = \rho_{X, Y} = \rho(X, Y) = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y} $$

**Theorem 4.18**.  The covariance satisfies:

$$ \text{Cov}(X, Y) = \mathbb{E}(XY) - \mathbb{E}(X) \mathbb{E}(Y) $$

The correlation satisfies:

$$ -1 \leq \rho(X, Y) \leq 1 $$

If $Y = a + bX$ for some constants $a$ and $b$ then $\rho(X, Y) = 1$ if $b > 0$ and $\rho(X, Y) = -1$ if $b < 0$.  If $X$ and $Y$ are independent, then $\text{Cov}(X, Y) = \rho = 0$.  The converse is not true in general.

**Theorem 4.19**.

$$ 
\mathbb{V}(X + Y) = \mathbb{V}(X) + \mathbb{V}(Y) + 2 \text{Cov}(X, Y)
\quad \text{ and } \quad
\mathbb{V}(X - Y) = \mathbb{V}(X) + \mathbb{V}(Y) - 2 \text{Cov}(X, Y)
$$

More generally, for random variables $X_1, \dots, X_n$,

$$ \mathbb{V}\left( \sum_i a_i X_i \right) = \sum_i a_i^2 \mathbb{V}(X_i) + 2 \sum \sum_{i < j} a_i a_j \text{Cov}(X_i, X_j) $$

### 4.4 Expectation and Variance of Important Random Variables

$$
\begin{array}{lll}
\text{Distribution} & \text{Mean} & \text{Variance}           \\
\hline
\text{Point mass at } p      & a             & 0              \\
\text{Bernoulli}(p)          & p             & p(1-p)         \\
\text{Binomial}(n, p)        & np            & np(1-p)        \\
\text{Geometric}(p)          & 1/p           & (1 - p)/p^2    \\
\text{Poisson}(\lambda)      & \lambda       & \lambda        \\
\text{Uniform}(a, b)         & (a + b) / 2   & (b - a)^2 / 12 \\
\text{Normal}(\mu, \sigma^2) & \mu           & \sigma^2       \\
\text{Exponential}(\beta)    & \beta         & \beta^2        \\
\text{Gamma}(\alpha, \beta)  & \alpha \beta  & \alpha \beta^2 \\
\text{Beta}(\alpha, \beta)   & \alpha / (\alpha + \beta) & \alpha \beta / ((\alpha + \beta)^2 (\alpha + \beta + 1)) \\
t_\nu                        & 0 \text{ (if } \nu > 1 \text{)} & \nu / (\nu - 2) \text{ (if } \nu > 2 \text{)} \\
\chi^2_p                     & p             & 2p             \\
\text{Multinomial}(n, p)     & np            & \text{see below} \\
\text{Multivariate Nornal}(\mu, \Sigma) & \mu & \Sigma \\
\end{array}
$$

The last two entries in the table are multivariate models which involve a random vector $X$ of the form

$$ X = \begin{pmatrix} X_1 \\ \vdots \\ X_k \end{pmatrix} $$

The mean of a random vector $X$ is defined by

$$ \mu = \begin{pmatrix} \mu_1 \\ \vdots \\ \mu_k \end{pmatrix} = \begin{pmatrix} \mathbb{E}(X_1) \\ \vdots \\ \mathbb{E}(X_k) \end{pmatrix} $$

The **variance-covariance matrix** $\Sigma$ is defined to be

$$ \Sigma = \begin{pmatrix}
\mathbb{V}(X_1) & \text{Cov}(X_1, X_2) & \cdots & \text{Cov}(X_1, X_k) \\
\text{Cov}(X_2, X_1) & \mathbb{V}(X_2) & \cdots & \text{Cov}(X_2, X_k) \\
\vdots & \vdots & \ddots & \vdots \\
\text{Cov}(X_k, X_1) & \text{Cov}(X_k, X_2) & \cdots & \mathbb{V}(X_k)
\end{pmatrix} $$

If $X \sim \text{Multinomial}(n, p)$ then

$$ 
\mathbb{E}(X) = np = n(p_1, \dots, p_k)
\quad \text{and} \quad
\mathbb{V}(X) = \begin{pmatrix}
np_1(1 - p_1) & -np_1p_2 & \cdots & np_1p_k \\
-np_2p_1 & np_2(1 - p_2) & \cdots & np_2p_k \\
\vdots & \vdots & \ddots & \vdots \\
-np_kp_1 & -np_kp_2 & \cdots & np_k(1 - p_k)
\end{pmatrix} $$

To see this:

- Note that the marginal distribution of any one component is $X_i \sim \text{Binomial}(n, p_i)$, so $\mathbb{E}(X_i) = np_i$ and $\mathbb{V}(X_i) = np_i(1 - p_i)$.  
- Note that, for $i \neq j$, $X_i + X_j \sim \text{Binomial}(n, p_i + p_j)$, so $\mathbb{V}(X_i + X_j) = n(p_i + p_j)(1 - (p_i + p_j))$.
- Using the formula for the covariance of a sum, for $i \neq j$,

$$ \mathbb{V}(X_i + X_j) = \mathbb{V}(X_i) + \mathbb{V}(X_j) + 2 \text{Cov}(X_i, X_j) =  np_i(1 - p_i) + np_j(1 - p_j) + 2 \text{Cov}(X_i, X_j) $$

Equating the last two formulas we get a formula for the covariance, $\text{Cov}(X_i, X_j) = -np_ip_j$.

Finally, here's a lemma that can be useful for finding means and variances of linear combinations of multivariate random vectors.

**Lemma 4.20**.  If $a$ is a vector and $X$ is a random vector with mean $\mu$ and variance $\Sigma$ then

$$ \mathbb{E}(a^T X) = a^T \mu
\quad \text{and} \quad
\mathbb{V}(a^T X) = a^T \Sigma a $$

If $A$ is a matrix then

$$ \mathbb{E}(A X) = A \mu
\quad \text{and} \quad
\mathbb{V}(AX) = A \Sigma A^T $$

### 4.5 Conditional Expectation

The conditional expectation of $X$ given $Y = y$ is

$$ \mathbb{E}(X | Y = y) = \begin{cases}
\sum x f_{X | Y}(x | y) &\text{ discrete case} \\
\int x f_{X | Y}(x | y) dy &\text{ continuous case}
\end{cases}
$$

If $r$ is a function of $x$ and $y$ then

$$ \mathbb{E}(r(X, Y) | Y = y) = \begin{cases}
\sum r(x, y) f_{X | Y}(x | y) &\text{ discrete case} \\
\int r(x, y) f_{X | Y}(x | y) dy &\text{ continuous case}
\end{cases}
$$

While $\mathbb{E}(X)$ is a number, $\mathbb{E}(X | Y = y)$ is a function of $y$.  Before we observe $Y$, we don't know the value of $\mathbb{E}(X | Y = y)$ so it is a random variable which we denote $\mathbb{E}(X | Y)$.  In other words, $\mathbb{E}(X | Y)$ is the random variable whose value is $\mathbb{E}(X | Y = y)$ when $Y$ is observed as $y$.  Similarly, $\mathbb{E}(r(X, Y) | Y)$ is the random variable whose value is $\mathbb{E}(r(X, Y) | Y = y)$ when $Y$ is observed as $y$.

**Theorem 4.23 (The rule of iterated expectations)**.  For random variables $X$ and $Y$, assuming the expectations exist, we have that

$$ \mathbb{E}[\mathbb{E}(Y | X)] = \mathbb{E}(Y)
\quad \text{and} \quad
\mathbb{E}[\mathbb{E}(X | Y)] = \mathbb{E}(X) $$

More generally, for any function $r(x, y)$ we have

$$ \mathbb{E}[\mathbb{E}(r(X, Y) | X)] = \mathbb{E}(r(X, Y))
\quad \text{and} \quad
\mathbb{E}[\mathbb{E}(r(X, Y) | Y)] = \mathbb{E}(r(X, Y)) $$

**Proof**.  We will prove the first equation.

$$ 
\begin{align}
\mathbb{E}[\mathbb{E}(Y | X)] &= \int \mathbb{E}(Y | X = x) f_X(x) dx = \int \int y f(y | x) dy f(x) dx \\
&= \int \int y f(y|x) f(x) dx dy = \int \int y f(x, y) dx dy = \mathbb{E}(Y)
\end{align}
$$

The **conditional variance** is defined as

$$ \mathbb{V}(Y | X = x) = \int (y - \mu(x))^2 f(y | x) dx $$

where $\mu(x) = \mathbb{E}(Y | X = x)$.

**Theorem 4.26**.  For random variables $X$ and $Y$,

$$ \mathbb{V}(Y) = \mathbb{E}\mathbb{V}(Y | X) + \mathbb{V} \mathbb{E} (Y | X)$$

### 4.6 Technical Appendix

#### 4.6.1 Expectation as an Integral

The integral of a measurable function $r(x)$ is defined as follows.  First suppose that $r$ is simple, meaning that it takes finitely many values $a_1, \dots, a_k$ over a partition $A_1, \dots, A_k$.  Then  $\int r(x) dF(x) = \sum_{i=1}^k a_i \mathbb{P}(r(X) \in A_i)$.  The integral of a positive measurable function $r$ is defined by $\int r(x) dF(x) = \lim_i \int r_i(x) dF(x)$, where $r_i$ is a sequence of simple functions such that $r_i(x) \leq r(x)$ and $r_i(x) \rightarrow r(x)$ as $i \rightarrow \infty$.  This does not depend on the particular sequence.  The integral of a measurable function $r$ is defined to be $\int r(x) dF(x) = \int r^+(x) dF(x) - \int r^-(x) dF(x)$ assuming both integrals are finite, where $r^+(x) = \max \{ r(x), 0 \}$ and $r^-(x) = \min\{ r(x), 0 \}$.

#### 4.6.2  Moment Generating Functions

The **moment generating function (mgf)** or **Laplace transform** of $X$ is defined by

$$ \psi_X(t) = \mathbb{E}(e^{tX}) = \int e^{tx} dF(x) $$

where $t$ varies over the real numbers.