# Variance

## Dependency

In [2]:
import numpy as np

Formula of **variance** $\sigma^2$ is the following. $n$ is the number of data. $x_i$ is each data. $\bar{x}$ is the average of the data.

$$
\sigma^2 = \frac{1}{n - 1} \sum_{i = 1}^{n} (x_i - \bar{x})^2
$$

$-\bar{x}$ is **mean-center**. Variance indicates the dispersion around the average. The following two datasets should have the same variance. But without mean-center, the measurement will be different. It's just the d2 data are shifted by 100, but the dispersion are the same.

In [12]:
d1 = np.array([1, 2, 3, 3, 2, 1])
d2 = np.array([101, 102, 103, 103, 102, 101])

var1 = np.var(d1)
var2 = np.var(d2)

print(f'd1 variance: {var1:.2f}')
print(f'd2 variance: {var2:.2f}')
print()


def wrong_variance(data):
    return np.divide(1, len(data) - 1) * np.sum(np.square(data))


wrong_var1 = wrong_variance(d1)
wrong_var2 = wrong_variance(d2)

print(f'd1 wrong variance: {wrong_var1:.2f}')
print(f'd2 wrong variance: {wrong_var2:.2f}')

d1 variance: 0.67
d2 variance: 0.67

d1 wrong variance: 5.60
d2 wrong variance: 12485.60


Variance is computed by dividing by $n - 1$.

- Dividing by $n - 1$ is for **sample** variance.
- Dividing by $n$ is for **population** variance. 

Population **mean** is a theoretical quantity, which theoretically doesn't change because we don't sample data. But sample mean is an empirical quantity, which differ for every single sample drawn from the population. 


Why divide by $n - 1$? It will be clear by an example of a die.

The population mean of a die is 3.5 because $1 + 2 + 3 + 4 + 5 + 6 = 21$, and $21 / 6 = 3.5$. But if we choose the number of samples of rolling a die, for example 4 times, so $n = 4$, and if this sample mean is 3, we only need to know values from $n - 1 = 4 - 1 = 3$ samples to know all 4 values. When 3 samples values are [1, 2, 4],

$$
\frac{\text{sum of values}}{4} = 3
$$
$$
\text{sum of values} = 12
$$
$$
12 - (1 + 2 + 4) = 5
$$

This means that we have $n - 1$ free values, but the last one will be automatically fixed in the relationship. We can change, when sample size is 4, any values for [1, 2, 4], but the last one is always fixed because the sample mean is 3. This is called **degrees of freedom**.

## Fano Factor

**Fano factor** $F$ is a normalized measure of variability. $\sigma^2$ is variance. $\mu$ is mean. Only used of the dataset with positive values, otherwise $\mu$ could be 0, and we can't 0 division.

$$
F = \frac{\sigma^2}{\mu}
$$

## Coefficient of Variation

**Coefficient of variation** $CV$ is also a normalized measure of variability. $\sigma$ is standard deviation. Only used of the dataset with positive values, otherwise $\mu$ could be 0, and we can't 0 division.

$$
CV = \frac{\sigma}{\mu}
$$