<img src="./images/banner.png" width="800">

# Expectation, Variance, and Moments

Welcome to our deep dive into the world of expectation, variance, and moments! 🎢📊


In our previous lecture, we explored the foundations of random variables and probability distributions. Now, we're going to take it up a notch and learn how to describe and summarize these distributions using powerful mathematical tools.


In this notebook, we'll discuss the following concepts:
1. **Expected Value (Mean)**: The "average" outcome of a random variable
2. **Variance and Standard Deviation**: Measures of spread or dispersion
3. **Covariance and Correlation**: Understanding relationships between variables
4. **Moments**: A general way to characterize distributions
5. **Moment Generating Functions**: A powerful tool for working with moments


These concepts are crucial in data science and machine learning for several reasons:

- They help us summarize and understand large datasets
- They're fundamental to many statistical tests and models
- They're used in feature engineering and selection
- They're essential for understanding model performance and uncertainty


As we explore these topics, try to think about how they might apply to real-world scenarios. Whether you're analyzing stock prices, predicting customer behavior, or optimizing algorithms, these tools will be invaluable.


Before we dive in, let's activate our probability neurons:

1. If you roll a fair six-sided die, what number do you "expect" to get?
2. In a group of 100 people, would you be more surprised by everyone having the same birthday, or by no two people sharing a birthday?


Keep these questions in mind as we begin our journey into expectation and variance. Let's get started! 🚀

**Table of contents**<a id='toc0_'></a>    
- [Expected Value (Mean)](#toc1_)    
  - [Properties of Expectation](#toc1_1_)    
  - [Real-world Application](#toc1_2_)    
- [Variance and Standard Deviation](#toc2_)    
  - [Properties of Variance](#toc2_1_)    
  - [Examples and Calculations](#toc2_2_)    
  - [Real-world Application](#toc2_3_)    
- [Covariance and Correlation](#toc3_)    
  - [Covariance: Measuring Relationship](#toc3_1_)    
  - [Interpreting Covariance](#toc3_2_)    
  - [Correlation: Standardized Covariance](#toc3_3_)    
  - [Interpreting Correlation](#toc3_4_)    
  - [Properties and Examples](#toc3_5_)    
  - [Real-world Applications](#toc3_6_)    
- [Moments of Random Variables](#toc4_)    
  - [Central Moments](#toc4_1_)    
  - [Relationships to Other Statistical Measures](#toc4_2_)    
  - [Real-world Application](#toc4_3_)    
- [Conclusion](#toc5_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=2
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_'></a>[Expected Value (Mean)](#toc0_)

Imagine you're playing a game where you flip a coin. If it's heads, you win $10, and if it's tails, you lose $5. How much money should you expect to win (or lose) on average if you play this game many times? This is where the concept of expected value comes in handy.


The expected value, often called the mean, is the average outcome of a random variable if an experiment were repeated infinitely many times. It's a powerful tool that helps us understand the central tendency of a distribution.


For discrete random variables, we calculate the expected value by summing up each possible outcome multiplied by its probability:
- $E[X] = \sum_{x} x \cdot p(x)$


For continuous random variables, we use integration instead:
- $E[X] = \int_{-\infty}^{\infty} x \cdot f(x) dx$


Let's go back to our coin flip game. Here's how we'd calculate the expected value:

- $E[X] = 10 \cdot P(\text{Heads}) + (-5) \cdot P(\text{Tails}) = 10 \cdot 0.5 + (-5) \cdot 0.5 = 2.5$

So, on average, you'd expect to win $2.50 per game if you played many times.


### <a id='toc1_1_'></a>[Properties of Expectation](#toc0_)


Expectation has some useful properties that make calculations easier:

1. **Linearity**: The expectation of a sum is the sum of the expectations. For example, if you played the coin flip game twice, your expected winnings would be $2.50 + $2.50 = $5.00.

2. **Scaling**: If you multiply a random variable by a constant, you multiply its expectation by that constant. If the coin flip game prizes were doubled, your expected winnings would be $5.00 per game.


### <a id='toc1_2_'></a>[Real-world Application](#toc0_)

Expected values are crucial in many fields. Insurance companies use them to set premiums, investors use them to evaluate potential returns, and machine learning algorithms use them to make predictions.


For instance, in a recommendation system, we might calculate the expected rating a user would give to a movie based on their past ratings and the ratings of similar users.


Understanding expected values is the first step in describing probability distributions. In the next section, we'll explore how values can deviate from this average with variance and standard deviation. This will give us a more complete picture of the distribution's behavior.


Remember, while the expected value tells us the average outcome, individual results can (and often do) differ. It's a guide, not a guarantee!

## <a id='toc2_'></a>[Variance and Standard Deviation](#toc0_)

Imagine you're a weather forecaster. You know that the average temperature in your city during summer is 25°C (77°F). But does this tell the whole story? What if one day it's 15°C and the next it's 35°C? This is where variance and standard deviation come into play.


Variance measures how far a set of numbers are spread out from their average value. It gives us an idea of the variability in our data.


For a random variable X with expected value E[X], the variance is defined as:

- $Var(X) = E[(X - E[X])^2]$


For discrete random variables:
- $Var(X) = \sum_x (x - E[X])^2 \cdot p(x)$

For continuous random variables:
- $Var(X) = \int_{-\infty}^{\infty} (x - E[X])^2 \cdot f(x) dx$


The standard deviation is simply the square root of the variance:

- $\sigma = \sqrt{Var(X)}$


Standard deviation is often preferred because it's in the same units as the original data. In our weather example, if the variance is 25, the standard deviation would be 5°C, giving us a more intuitive measure of variability.


### <a id='toc2_1_'></a>[Properties of Variance](#toc0_)


1. **Non-negativity**: Variance is always non-negative. It's zero only when all values are identical.

2. **Scaling**: If we multiply a random variable by a constant c, the variance is multiplied by c^2:
   $Var(cX) = c^2 Var(X)$

3. **Variance of a sum**: For independent random variables X and Y:
   $Var(X + Y) = Var(X) + Var(Y)$


### <a id='toc2_2_'></a>[Examples and Calculations](#toc0_)


Let's return to our coin flip game where heads wins $10 and tails loses $5.


We calculated E[X] = 2.5. Now let's find the variance:

- $Var(X) = (10 - 2.5)^2 \cdot 0.5 + (-5 - 2.5)^2 \cdot 0.5 = 56.25$


The standard deviation is $\sqrt{56.25} = 7.5$


This tells us that while we expect to win $2.50 on average, individual games can deviate quite a bit from this average.


### <a id='toc2_3_'></a>[Real-world Application](#toc0_)


Variance and standard deviation are crucial in many fields:

1. In finance, they're used to measure risk. A higher standard deviation in stock returns indicates higher volatility.

2. In quality control, they help determine if a manufacturing process is consistent.

3. In machine learning, they're used in feature scaling and in algorithms like Gaussian Processes.


Understanding variance and standard deviation gives us a more complete picture of our data's behavior. While the expected value tells us where the center of our distribution is, variance tells us how spread out it is.


In the next section, we'll explore how we can use these concepts to understand relationships between different variables through covariance and correlation. Stay tuned!

## <a id='toc3_'></a>[Covariance and Correlation](#toc0_)

Imagine you're analyzing ice cream sales and temperature data. You might notice that as temperature increases, so do ice cream sales. But how can we quantify this relationship? This is where covariance and correlation come in handy.


### <a id='toc3_1_'></a>[Covariance: Measuring Relationship](#toc0_)


Covariance measures how two variables change together. It tells us about the direction of the linear relationship between variables.


For random variables X and Y, covariance is defined as:

- $Cov(X,Y) = E[(X - E[X])(Y - E[Y])]$


In practice, for a sample of n pairs of values, we calculate it as:

- $Cov(X,Y) = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})$

Where $\bar{x}$ and $\bar{y}$ are the sample means.


### <a id='toc3_2_'></a>[Interpreting Covariance](#toc0_)


- Positive covariance: Variables tend to move in the same direction
- Negative covariance: Variables tend to move in opposite directions
- Zero covariance: No linear relationship between the variables


However, the magnitude of covariance is hard to interpret because it depends on the scales of the variables. This is where correlation comes in.


### <a id='toc3_3_'></a>[Correlation: Standardized Covariance](#toc0_)


The correlation coefficient standardizes covariance, giving us a measure that's easier to interpret. It's defined as:

- $\rho_{X,Y} = \frac{Cov(X,Y)}{\sigma_X \sigma_Y}$

Where $\sigma_X$ and $\sigma_Y$ are the standard deviations of X and Y.


### <a id='toc3_4_'></a>[Interpreting Correlation](#toc0_)


- Correlation always falls between -1 and 1
- 1 indicates perfect positive linear relationship
- -1 indicates perfect negative linear relationship
- 0 indicates no linear relationship


For our ice cream example, a correlation of 0.8 between temperature and sales would indicate a strong positive relationship.


### <a id='toc3_5_'></a>[Properties and Examples](#toc0_)


1. **Symmetry**: $Cov(X,Y) = Cov(Y,X)$ and $\rho_{X,Y} = \rho_{Y,X}$

2. **Covariance with self**: $Cov(X,X) = Var(X)$

3. **Correlation with self**: $\rho_{X,X} = 1$

4. **Independence**: If X and Y are independent, $Cov(X,Y) = 0$, but the converse isn't always true


Example: Stock Returns
Suppose we have two stocks, A and B. Their daily returns over a week are:

- A: 1%, -2%, 3%, -1%, 2%
- B: 2%, -1%, 3%, -2%, 1%


Calculating, we find $Cov(A,B) = 0.00255$ and $\rho_{A,B} = 0.9$. This high positive correlation suggests these stocks tend to move together.


### <a id='toc3_6_'></a>[Real-world Applications](#toc0_)


1. In finance, correlation between assets is crucial for portfolio diversification
2. In machine learning, feature selection often involves analyzing correlations
3. In meteorology, understanding correlations helps in weather prediction


Remember, correlation doesn't imply causation. Ice cream doesn't cause high temperatures! It's just that both are affected by a common factor (warm weather).


Understanding covariance and correlation allows us to quantify relationships between variables, a crucial skill in data analysis and modeling. In our next section, we'll explore higher-order moments, which give us even more tools to describe distributions. Stay tuned!

## <a id='toc4_'></a>[Moments of Random Variables](#toc0_)

Random variable moments are **important concepts in probability theory and statistics** that help describe the characteristics and shape of a probability distribution. They provide valuable information about the distribution's central tendency, spread, skewness, and other properties.

Imagine trying to describe the shape of a mountain to someone who's never seen it. You might start with its average height, then talk about how spread out it is, whether it's symmetrical, and how steep the slopes are. In the world of probability and statistics, moments play a similar role in describing the shape of probability distributions.


The most commonly used moments are:

1. **First Moment (Mean or Expected Value)**

    - Represents the *average* or *central value* of the distribution
    - Measure of **central tendency**
    - Formula:
      - Discrete: `E[X] = ∑(x * P(X=x))`
      - Continuous: `E[X] = ∫(x * f(x)dx)`

2. **Second Central Moment (Variance)**

    - Measures the *spread* or *dispersion* of the distribution around the mean
    - Average **squared deviation** from the mean
    - Formula: `Var(X) = E[(X - E[X])²]`

3. **Third Central Moment**

    - Used to calculate **skewness**, which measures the *asymmetry* of the distribution
    - Interpretation:
      - *Positive skew*: longer tail on the right side
      - *Negative skew*: longer tail on the left side

4. **Fourth Central Moment**

    - Used to calculate **kurtosis**, which measures the *"tailedness"* or *peakedness* of the distribution
    - Indicates whether the distribution has *heavier* or *lighter* tails compared to a normal distribution

> Higher-order moments (5th, 6th, etc.) exist but are less commonly used in practice.


The term "moment" comes from **physics**, where it refers to the product of a distance and a physical quantity. In statistics, it's a similar concept: the product of the random variable raised to a power and its probability.


Understanding moments is crucial because they provide a way to **characterize probability distributions** without needing to know the entire distribution function. They are particularly useful in *method of moments estimation*, a technique for estimating population parameters.

So moments are a set of quantities that provide information about a distribution's shape. The nth moment of a random variable X is defined as:

- $\mu_n = E[X^n]$


Let's break down the first four moments:

1. **First Moment (μ₁)**: This is simply the expected value, E[X]. It represents the center of the distribution.

2. **Second Moment (μ₂)**: E[X²]. This is related to the spread of the distribution.

3. **Third Moment (μ₃)**: E[X³]. This gives us information about the skewness of the distribution.

4. **Fourth Moment (μ₄)**: E[X⁴]. This is related to the kurtosis, or "peakedness" of the distribution.


### <a id='toc4_1_'></a>[Central Moments](#toc0_)


Central moments are calculated using deviations from the mean. The nth central moment is defined as:

- $\mu_n' = E[(X - E[X])^n]$


The first central moment is always zero. The second central moment is the variance we discussed earlier. Central moments are often more useful for describing distribution shapes.


### <a id='toc4_2_'></a>[Relationships to Other Statistical Measures](#toc0_)


1. **Mean**: First moment, μ₁ = E[X]

2. **Variance**: Second central moment, μ₂' = E[(X - E[X])²]

3. **Skewness**: Standardized third central moment
   $\gamma_1 = \frac{E[(X - E[X])^3]}{(E[(X - E[X])^2])^{3/2}}$

   - Positive skewness: right tail is longer
   - Negative skewness: left tail is longer
   - Zero skewness: symmetric distribution

4. **Kurtosis**: Related to the fourth central moment
   $\gamma_2 = \frac{E[(X - E[X])^4]}{(E[(X - E[X])^2])^2} - 3$

   - Positive kurtosis: heavier tails than normal distribution
   - Negative kurtosis: lighter tails than normal distribution
   - Zero kurtosis: similar to normal distribution


### <a id='toc4_3_'></a>[Real-world Application](#toc0_)


Understanding moments is crucial in many fields:

1. In finance, skewness and kurtosis are used to assess investment risk beyond just variance.

2. In manufacturing, moments help in quality control by characterizing the distribution of product measurements.

3. In machine learning, moments are used in method of moments estimation and in designing robust algorithms.


For example, in algorithmic trading, a strategy might avoid assets with high positive skewness (rare but extreme positive returns) if the goal is steady, predictable returns.


Moments provide a powerful set of tools for describing and analyzing probability distributions. They allow us to go beyond simple measures of center and spread, giving us a more complete picture of a distribution's shape.


## <a id='toc5_'></a>[Conclusion](#toc0_)

Let's recap the key concepts we've explored:

1. **Expected Value (Mean)**: We learned how to calculate the average outcome of a random variable, a fundamental concept in probability theory.

2. **Variance and Standard Deviation**: We discovered how to measure the spread of a distribution, giving us insight into the variability of outcomes.

3. **Covariance and Correlation**: We explored how to quantify relationships between variables, a crucial skill in data analysis and modeling.

4. **Moments**: We delved into higher-order moments, which provide a comprehensive description of a distribution's shape.


These concepts form the backbone of descriptive statistics and are fundamental to many areas of data science and machine learning. They allow us to:

- Summarize complex datasets succinctly
- Make predictions about future outcomes
- Assess relationships between variables
- Characterize the shape and behavior of probability distributions


As you move forward in your journey through probability and statistics, you'll find these concepts appearing again and again. They're essential tools in the data scientist's toolkit, used in everything from basic data analysis to advanced machine learning algorithms.


Remember, while these mathematical concepts might seem abstract, they have real-world applications in fields as diverse as finance, physics, social sciences, and technology. Whether you're predicting stock prices, analyzing experimental results, or building recommendation systems, the concepts we've covered today will be invaluable.


In our next lecture, we'll build on these foundations to explore more advanced topics in probability theory. Keep practicing these concepts, and don't hesitate to revisit this material as needed. The more comfortable you become with these ideas, the more powerful your data analysis skills will become.
