
## Exploration of variance and standard deviation
[From wikipedia](https://en.wikipedia.org/wiki/Variance):

>There are two distinct concepts that are both called "variance". One, is part of a theoretical probability distribution and is defined by an equation. The other variance is a characteristic of a set of observations. When variance is calculated from observations, those observations are typically measured from a real-world system. If all possible observations of the system are present, then the calculated variance is called the population variance. Normally, however, only a subset is available, and the variance calculated from this is called the sample variance. The variance calculated from a sample is considered an estimate of the full population variance.

---

### Variance
Variance measures how far a set of numbers are spread out from their average (mean). It quantifies the degree of variation or dispersion in a dataset.

- **Mathematical Definition**:  
    Variance for a finite set of samples $\sigma^2$ is calculated as:  
    $$
    \sigma^2 = \frac{1}{N} \sum_{i=1}^N (x_i - \mu)^2
    $$  
    Where:
    - $N$: Number of data points
    - $x_i$: Each individual data point
    - $\mu$: population average of the quantity measured

    The variance of a random variable $X$ is the expected value of the squared deviation from the mean of $X$, $\mu = \operatorname{E}[X]$:

    $$
    \operatorname{Var}(X) = \operatorname{E}\left[(X - \mu)^2\right]
    $$

    This definition encompasses random variables that are generated by processes that are discrete, continuous, neither, or mixed. The variance can also be thought of as the covariance of a random variable with itself:

    $$
    \operatorname{Var}(X) = \operatorname{Cov}(X, X)
    $$

- **Analogy**: Imagine a group of runners on a track. Variance tells us how far apart the runners are from each other and from the average position. A high variance means the runners are spread out, while a low variance means they are clustered together.

- **Practical Meaning**: A higher variance indicates greater variability in the data, while a lower variance suggests the data points are closer to the mean.

---

### Standard Deviation
Standard deviation is the square root of variance. It provides a measure of dispersion in the same units as the data, making it more interpretable.

- **Mathematical Definition**:  
    Standard deviation for a finite set of samples $ \sigma $ is calculated as:  
    $$
    \sigma = \sqrt{\sigma^2} = \sqrt{\frac{1}{N} \sum_{i=1}^N (x_i - \mu)^2} = \frac{1}{\sqrt{N}} \sqrt{\sum_{i=1}^N (x_i - \mu)^2}
    $$

- **Analogy**: Using the same runners analogy, standard deviation tells us the average distance of the runners from the mean position, but in the same unit as the track (e.g., meters).

- **Practical Meaning**: A smaller standard deviation means the data points are tightly clustered around the mean, while a larger standard deviation indicates more spread.

---

### Comparison of Variance and Standard Deviation
| Aspect               | Variance $\sigma^2$                      | Standard Deviation $\sigma$              |
|----------------------|------------------------------------------|------------------------------------------|
| **Definition**       | Average of squared differences from mean | Square root of variance                  |
| **Units**            | Squared units of the data                | Same units as the data                   |
| **Interpretability** | Less intuitive due to squared units      | More intuitive and directly comparable   |
| **Relationship**     | $\sigma^2 = \sigma^2$                    | $\sigma = \sqrt{\sigma^2}$               |

---

### Summary
Variance and standard deviation both measure the spread of data, but standard deviation is often preferred for practical applications due to its interpretability. Together, they provide valuable insights into the variability of a dataset. They are both a measure of dispersion with respect to the average of mean of the data or distribution.

### Examples

In [1]:
from scipy.stats import norm, expon
import numpy as np

In [2]:
def compute_discrete_variance(x):
    """A simple implementation of variance for discrete random variables."""
    _expected_value = np.sum(x)/np.size(x)
    _variance = np.sum((x-_expected_value)**2)/np.size(x)
    return _variance

def compute_discrete_variance_v2(x):
    _expected_value = np.sum(x)/np.size(x)
    _variance = np.sum(x**2)/np.size(x) - _expected_value**2
    return _variance

def compute_discrete_variance_v3(x):
    """This implementation does not require the expected value to be computed."""
    # Braodcast the array to a 2D array and compute pairwise differences
    diffs = x[:,None] - x
    # Compute the upper triangular part of the matrix (excluding the diagonal)
    upper_diag_elements = diffs[np.triu_indices(diffs.shape[0], k=1)]
    # Compute the variance using the upper triangular elements
    # The variance is the sum of squares of the upper triangular elements divided by n^2
    _variance = np.sum(upper_diag_elements**2) / x.size**2
    return _variance


In [3]:
x_norm_sz = 1000
x_norm = norm.rvs(loc=0, scale=1, size=x_norm_sz)
x_var = x_norm.var()
x_var_custom = compute_discrete_variance(x_norm)
x_custom_v2 = compute_discrete_variance_v2(x_norm)
x_custom_v3 = compute_discrete_variance_v3(x_norm)
print(f"Variance of normal distribution: {x_var}")
print(f"Custom variance implementation of normal distribution: {x_var_custom}")
print(f"Custom variance implementation v2 of normal distribution: {x_custom_v2}")
print(f"Custom variance implementation v3 of normal distribution: {x_custom_v3}")

Variance of normal distribution: 0.9749859581407636
Custom variance implementation of normal distribution: 0.9749859581407636
Custom variance implementation v2 of normal distribution: 0.9749859581407637
Custom variance implementation v3 of normal distribution: 0.9749859581407635


### Properties of variance

Variance is invariant with respect to changes in a location parameter. That is, if a constant is added to all values of the variable, the variance is unchanged:

$$Var(X+a)=Var(X)$$



In [4]:
(x_norm + 10).var(), x_norm.var()

(np.float64(0.9749859581407635), np.float64(0.9749859581407636))

If all values are scaled by a constant, the variance is scaled by the square of that constant:
$$Var(aX)=a^2Var(X)$$


In [5]:
a = 2
a**2 * x_norm.var(), (x_norm*a).var()

(np.float64(3.8999438325630544), np.float64(3.8999438325630544))

The variance of the sum (or difference) of multiple independent random variables (uncorrelated random variables) is the sum of their variances. This fact is one of the reasons for the use of the variance in preference to other measures of dispersion.

In [6]:
x_norm_2_sz = 1000
x_norm_2 = norm.rvs(loc=0, scale=2, size=x_norm_2_sz)
x_norm_2_var = x_norm_2.var()
var_sum = x_norm_2.var() + x_norm.var()
print(f"Variance of normal distribution with scale 2: {x_norm_2_var:.3f}")
print(f"Variance of standard normal distribution: {x_var:.3f}")
print(f"Variance of normal distribution with scale 2 + standard normal distribution: {var_sum:.3f}")


Variance of normal distribution with scale 2: 4.081
Variance of standard normal distribution: 0.975
Variance of normal distribution with scale 2 + standard normal distribution: 5.056


In [7]:
# Weighted average of variances
weighted_var = (x_norm_sz * x_var + x_norm_2_sz * x_norm_2_var) / (x_norm_sz + x_norm_2_sz)
# Between-group variance term
between_group_variance = (x_norm_sz * x_norm_2_var) / (x_norm_sz + x_norm_2_sz)**2 * (x_norm.mean() - x_norm_2.mean())**2
# Full analytical formula
var_union_formula = weighted_var + between_group_variance
# Empirical variance of the union of two distributions
empirical_unionv_var = np.concatenate((x_norm, x_norm_2)).var()

print(f"Between-group variance term: {between_group_variance:.3f}")
print(f"Weighted variance of normal distribution with scale 2 and standar normal distribution: {weighted_var:.3f}")
print(f"Combined variance: {var_union_formula:.3f}")
print(f"Empirical variance of normal distribution with scale 2 and standard normal distribution: {empirical_unionv_var:.3f}")

Between-group variance term: 0.000
Weighted variance of normal distribution with scale 2 and standar normal distribution: 2.528
Combined variance: 2.528
Empirical variance of normal distribution with scale 2 and standard normal distribution: 2.534


### Difference between the union and the sum of two random variables
Why doesn't the combined variance equal the sum of the individual variances for independent samples?

When combining two independent random variables or samples, it's important to distinguish between two different operations:

1. **Variance of the sum** ($\operatorname{Var}(X+Y)$):
   - If $X$ and $Y$ are independent, then:
     $$
     \operatorname{Var}(X+Y) = \operatorname{Var}(X) + \operatorname{Var}(Y)
     $$
   - This formula applies when you are adding the random variables (e.g., $X+Y$), not when you are concatenating their samples.

2. **Variance of the union (concatenation) of samples**:
   - If you concatenate two samples (combine all values from $X$ and $Y$ into a single array), the variance of the combined sample is:
     $$
     \operatorname{Var}(X \cup Y) = \frac{n_X \operatorname{Var}(X) + n_Y \operatorname{Var}(Y)}{n_X + n_Y}
     + \frac{n_X n_Y}{(n_X + n_Y)^2} (\mu_X - \mu_Y)^2
     $$
     where:
     - $n_X, n_Y$ are the sample sizes
     - $\mu_X, \mu_Y$ are the sample means
   - The **second term** is the between-group variance, which is only zero if the means are equal.

#### What causes the difference?
- The sum of variances applies to the sum of random variables, not to the variance of all values pooled together.
- When concatenating, if the means differ, the overall variance increases due to the spread between the groups.

Suppose $X$ and $Y$ are independent samples with different means:
- $X = [1, 2, 3]$ ($\mu_X = 2$)
- $Y = [7, 8, 9]$ ($\mu_Y = 8$)

Variance of the union $[1,2,3,7,8,9]$ is much larger than the average of the individual variances, because the means are far apart.

#### Summary Table
| Operation                | Formula                                                                 | When Equal?                |
|--------------------------|------------------------------------------------------------------------|----------------------------|
| Variance of the sum      | $\operatorname{Var}(X+Y) = \operatorname{Var}(X) + \operatorname{Var}(Y)$ | Always (if independent)    |
| Variance of the union    | See formula above                                                      | Only if means are identical|

**Key Point:**
- The sum of variances applies to the sum of random variables, not to the variance of all values pooled together.
- When concatenating, the difference in means increases the overall variance.
- $Var(X + Y) \neq Var(X \cup Y)$

#### Example

In [8]:
# Two independent samples with different means
x = np.array([1, 2, 3])
y = np.array([7, 8, 9])

# Variance of the sum (elementwise sum)
var_sum = np.var(x + y)
# Variance of the union (concatenation)
union = np.concatenate([x, y])
var_union = np.var(union)
# Individual variances
var_x = np.var(x)
var_y = np.var(y)
# Weighted average of variances
n_x, n_y = len(x), len(y)
weighted_var = (n_x * var_x + n_y * var_y) / (n_x + n_y)
# Between-group variance term
between = (n_x * n_y) / (n_x + n_y)**2 * (x.mean() - y.mean())**2
# Full formula
var_union_formula = weighted_var + between

print(f"Variance of sum (x+y): {var_sum}")
print(f"Variance of union (concatenation): {var_union}")
print(f"Weighted average of variances: {weighted_var}")
print(f"Between-group variance: {between}")
print(f"Variance of union (formula): {var_union_formula}")

Variance of sum (x+y): 2.6666666666666665
Variance of union (concatenation): 9.666666666666666
Weighted average of variances: 0.6666666666666666
Between-group variance: 9.0
Variance of union (formula): 9.666666666666666
