# Unit 15 - Variance

## Example: ages of friends and family

In [7]:
import numpy as np

friends = np.array([17, 19, 18, 17, 19])
np.mean(friends)

18.0

In [8]:
family = np.array([7, 38, 4, 23, 18])
np.mean(family)

18.0

Both groups have the same mean, but ages in the group of friends are clustered close to the mean, whereas ages of family members are all over the place. To capture the spread of the data we use the **variance**. First, we normalise by subtracting the mean from each data point, then add the squares of the values. The formula is

$$\text{Var} = \frac{1}{N} \sum_i (x_i - \mu)^2$$

In [20]:
varfri = np.sum((friends - np.mean(friends))**2)/friends.size
varfri

0.80000000000000004

In [22]:
varfam = np.sum((family - np.mean(family))**2)/family.size
varfam

148.40000000000001

The variance is a measure of how far the data is spread. It is the quadratic deviation from the mean. The **standard deviation** is the square root of the variance, and gives an estimate of how much to expect a typical data point to deviate from the mean.

The variance is usually denoted with $\sigma^2$ and the standard deviation with $\sigma$.

## Exercises
Calculate mean, variance and standard deviation.

In [25]:
data = np.array([3, 4, 5, 6, 7])

# Mean
datamean = np.sum(data)/data.size
datamean

5

In [26]:
# Variance
datavar = np.sum((data - datamean)**2)/data.size
datavar

2

In [27]:
# Standard deviation
np.sqrt(datavar)

1.4142135623730951

In [28]:
data = np.array([8, 9, 10, 11, 12])

# Mean
datamean = np.sum(data)/data.size
datamean

10

In [29]:
# Variance
datavar = np.sum((data - datamean)**2)/data.size
datavar

2

In [30]:
# Standard deviation
np.sqrt(datavar)

1.4142135623730951

The distribution is similar to the first case - once we subtract the mean, we get the same sequence. Thus, except for the mean we get the same answers here.

In [32]:
data = np.array([15, 20, 25, 30, 35])

# Mean
datamean = np.sum(data)/data.size
datamean

25

In [34]:
# Variance
datavar = np.sum((data - datamean)**2)/data.size
datavar

50

In [35]:
# Standard deviation
np.sqrt(datavar)

7.0710678118654755

This case is the first series, times 5 - thus the mean and the standard deviation are 5 times those of the first example. The variance is 25 times that of the first example (which is the square of 5).

In [36]:
data = np.array([3, 3, 3, 3, 3])

# Mean
datamean = np.sum(data)/data.size
datamean

3

In [38]:
# Variance
datavar = np.sum((data - datamean)**2)/data.size
datavar

0

When all datapoints are equal, the mean coincides with their value and both the variance and standard deviation are zero.

![](meanstdev.png)

The problem with the above formulas is that they require two passes over the data: one to get the mean, then another to get the variance.

To get the results in just one pass, we rewrite the formula for $\sigma^2$:

$$\sigma^2 = \frac{1}{N} \sum_i (x_i - \mu)^2$$
$$ = \frac{1}{N} \sum_i (x_i - \mu) (x_i - \mu)$$
$$ = \frac{1}{N} \sum_i (x_i^2 - 2\mu x_i + \mu^2)$$
$$ = \frac{1}{N} \sum_i x_i^2 - \frac{2\mu}{N} \sum_i x_i + \mu^2$$
$$ = \frac{1}{N} \sum_i x_i^2 - 2\mu^2+ \mu^2$$
$$ = \frac{1}{N} \sum_i x_i^2 - \mu^2$$
$$ = \frac{1}{N} \sum_i x_i^2 - \frac{1}{N^2} (\sum_i x_i)^2$$

In [47]:
data = np.array([3, 4, 5, 6, 7])

# Sum of elements
datasum = np.sum(data)
datasum

25

In [48]:
# Sum of squared elements
datasq = np.sum(data**2)
datasq

135

In [50]:
# Mean
datamean = datasum/data.size
datamean

5

In [52]:
# Variance
datavar = datasq/data.size - datamean**2
datavar

2

## Example: raise
An employer is considering giving a raise. Calculate the effect of the raise on the mean and standard deviation, considering two scenarios: a) a fixed amount of $1,000 or b) a relative raise of 20%.

First case:

$$\mu' = \frac{1}{N} \sum_i (x_i + 1000) = \frac{1}{N} \sum_i x_i + \frac{N \times 1000}{N} = \mu + 1000$$

$$\sigma'^2 = \frac{1}{N} \sum_i \left[(x_i + 1000) - \mu'\right]^2 = \frac{1}{N} \sum_i (x_i + 1000 - \mu - 1000)^2 = \sigma^2$$

Second case:

$$\mu' = \frac{1}{N} \sum_i 1.2 x_i = \frac{1.2}{N} \sum_i x_i = 1.2 \mu$$

$$\sigma'^2 = \frac{1}{N} \sum_i (1.2 x_i - \mu')^2 = \frac{1}{N} \sum_i (1.2 x_i - 1.2 \mu)^2 = 1.44 \sigma^2$$

## Standard score

Given a point $x$ on a distribution which can be described as Gaussian,

$$ z = \frac{x - \mu}{\sigma}$$

In [55]:
x = 2

sig = np.sqrt(datavar)
z = (x - datamean)/sig
z

-2.1213203435596424

In [56]:
x = 5

z = (x - datamean)/sig
z

0.0