In [None]:
import statistics as st
import random

# Things we need to know

* mean
* median
* mode
* variance
* standard deviation
    * population standard deviation
    * sample standard deviation

# Mean, median, and mode

The _mean_ is the average of a collection. It is sometimes called mu or $\mu$.

The _median_ is the middle number or average of the two middle numbers in a collection.

The _mode_ is the most frequent number in a collection.

In [None]:
def repeat(n, fn):
    return [fn() for _ in range(n)]

In [None]:
nums = repeat(40, lambda: random.randint(1, 100))
print(sorted(nums))

In [None]:
st.mean(nums)

In [None]:
st.median(nums)

In [None]:
st.mode(nums)

For a uniform distribution of numbers, the mean and median will likely be the same.

# Variance and standard deviation

The standard deviation (sometimes called sigma or $\sigma$) is a measurement of how spread out the numbers in a distribution are, but to calculate it, we need to know the _variance_.

The variance is the average of the squared differences from the mean.

Let's take some lifespans of cats to figure this out.

In [None]:
lifespans = [19, 16, 15, 10, 17, 19, 13, 10, 17, 11]

In [None]:
mean_lifespan = st.mean(lifespans)
mean_lifespan

In [None]:
differences_from_mean = [lifespan - mean_lifespan for lifespan in lifespans]
print(differences_from_mean)

In [None]:
squared_differences = [diff * diff for diff in differences_from_mean]
print(squared_differences)

In [None]:
variance = st.mean(squared_differences)
variance

The standard deviation is the square root of the variance.

In [None]:
standard_deviation = pow(variance, 0.5)
standard_deviation

Let's see if that matches what the `statistics` module gives us.

In [None]:
st.pstdev(lifespans)

Great! But...

## Population standard deviation vs sample standard deviation

What we have is the standard deviation in lifespan for this group of cats. If we want to extrapolate from that for the standard deviation in the whole population of cats, we do it a little differently. When calculating variance, instead of dividing by _n_, we divide by _n - 1_.

In [None]:
sample_variance = sum(squared_differences) / (len(squared_differences) - 1)
sample_standard_deviation = pow(sample_variance, 0.5)
print(sample_standard_deviation)

In the `statistics` module, this is accounted for by the `stdev` function (sample standard deviation) vs the `pstdev` function (population standard deviation).

In [None]:
st.stdev(lifespans)

## What does the standard deviation really mean?

When you look at your data, ~68% should fall within 1 standard deviation of the mean. ~95% should fall within 2 standard deviations, and ~99.7% should fall within 3 standard deviations. You may hear this called the _68-95-99 rule_.

# Further reading

* [Math is Fun - Statistics](http://www.mathsisfun.com/data/index.html)
* [Robert Niles' Statistics Guide](http://www.robertniles.com/stats/)