## Expectation

- Central tendency of distribution X
    - Mean(X) == E(X)
    - Median X: Middle of X
    - Mode X: Most common value X
    - Right skew X: Mean > Median
    - Left skew X: Mean < Median

- Expectation of a random variable is just the mean, but what if you want the expectation of a *function* of a random variable?
    - if $E[X] = x_1p_1(x_1) + x_2p_2(x_2) + x_3p_3(x_3)$
    - then $E[f(X)] = f(x_1)p_1(x_1) + f(x_2)p_2(x_2) + f(x_3)p_3(x_3)$
    - e.g. Let value from 1 roll of dice be X
        - $E[X] = 1 * \frac{1}{6} + 2 * \frac{1}{6} + 3 * \frac{1}{6} + 4 * \frac{1}{6} + 5 * \frac{1}{6} + 6 * \frac{1}{6} = 3.5$
        - $E[X^2] = 1^2 * \frac{1}{6} + 2^2 * \frac{1}{6} + 3^2 * \frac{1}{6} + 4^2 * \frac{1}{6} + 5^2 * \frac{1}{6} + 6^2 * \frac{1}{6} = \frac{91}{6}$

- Sum of expectations: $E[X_1 + X_2] = E[X_1] + E[X_2]$
    - e.g. Let's suppose you have 3 people assigned an index number each (1,2,3). You don't know what each of their numbers are, so you have to guess. Let $X$ be the number of people for whom you guess the right number. What is $E[X]$?
        - We know that $E[X] = E[X_1] + E[X_2] + E[X_3]$, where $X_n$ is an indicator that is 1 when the guessed number is right, and 0 if it is wrong
        - $E[X_i]$ is 1/3 for all values of $i$, because you are randomly guessing
        - So $E[X] = 1$


## Variance

- Compare 2 coin flip games: 
    - If heads, you win $1, tails, you lose $1
    - If heads, you win $100, tails, you lose $100
    - Expected value is 0 for both, but variance very different
        - $\frac{(1-0)^2 + (1-0)^2} {2} = 1$
        - $\frac{(100-0)^2 + (100-0)^2} {2} = 10000$

- Variance
$$\begin{align}
    Var(X) &= E[(X - \mu)^2] \\
    &= E[X^2 - 2(X)\mu + \mu^2] \\
    &= E[X^2] - 2 \cdot \mu \cdot E(X) + E(\mu^2) \\
    &= E[X^2] - 2 \cdot E(X) \cdot \mu + \mu^2 \\
    &= E[X^2] - 2 \cdot (E(X))^2 + (E[X])^2 \\ 
    &= E[X^2] - (E(X))^2
\end{align}$$

- Standard Deviation
    - Square root of variance, to preserve the units
    - i.e. if X is measured in meters, Var(X) will be in meters squared, and so it is not directly comparable to X
    - Std Dev preserves this

## Sum of gaussians

- Assume $X \sim N(\mu_x, \sigma_x^2)$, $Y \sim N(\mu_y, \sigma_y^2)$
- **If and only if** X and Y are independent, then $X+Y \sum N(\mu_x + \mu_y, \sigma_x^2 + \sigma_y^2)$

## Standardisation

- For many applications of ML, standardising a variable makes it less prone to computational issues
- Let Z be the standardised random variable from X
    - Usual way is $Z = \frac{X - \mu}{\sigma}$
- Proof that $E[Z] = 0$
    - $\begin{align}
        E[Z] &= E[X - \mu] \\
        &= E[X] - E[\mu] \\
        &= E[X] - E[X] \\
        &= 0
    \end{align}$
- Proof that $Var(Z) = 1$
    - $\begin{align}
        Var(Z) &= Var(\frac{X - \mu}{\sigma}) \\
        &= Var(\frac{X}{\sigma}) - Var(\frac{\mu}{\sigma}) \\
        &= Var(\frac{X}{\sigma}) \\
        &= Var(\frac{X}{\sigma}) \\
        &= \frac{1}{\sigma^2}Var(X)
    \end{align}$
    - $\begin{align}
        std(Z) &= \sqrt{\frac{1}{\sigma^2}Var(X)} \\
        &= \frac{1}{\sigma} std(X) \\
        &= \frac{\sigma}{\sigma} \\
        &= 1
    \end{align}$

## Skewness/Kurtosis

- Skewness: 3rd moment
    - Definition: $E[(\frac{(X - \mu)}{\sigma})^3]$
    - If skewness > 0, right skew (positive skew)
    - If skewness < 0, right skew (negative skew)
    - If skewness == 0, no skew
- E[X^4]: Kurtosis
    - Definition: $E[(\frac{(X - \mu)}{\sigma})^4]$
    - If kurtosis is large, thick tails in distribution
    - If kurtosis is small, thin tails in distribution

## Kernel density estimation

- When we plot a histogram for a given set of values, it often is not smooth. This is because, if we define some set of bins, then the bins are typically discontinuous
- To solve this, we can apply a **kernel** to smooth out the bins
    1. On the X-axis, plot a vertical line wherever you have an observed value
    2. Assume some distribution (a kernel) around the vertical line where you observe the point (use Gaussian for convenience)
    3. Sum all the densities and multiply by $1/n$, where $n$ is the number of observations you have
    4. This $1/n$ will give a value of 1 under the kernel density estimate