In [5]:
import statistics as st
import random

# Things we need to know

* mean
* median
* mode
* variance
* standard deviation
    * population standard deviation
    * sample standard deviation

# Mean, median, and mode

The _mean_ is the average of a collection. It is sometimes called mu or $\mu$.

The _median_ is the middle number or average of the two middle numbers in a collection.

The _mode_ is the most frequent number in a collection.

In [6]:
def repeat(n, fn):
    return [fn() for _ in range(n)]

In [41]:
nums = repeat(40, lambda: random.randint(1, 100))
print(nums)

[6, 12, 6, 78, 40, 59, 1, 68, 45, 95, 26, 49, 14, 43, 8, 42, 20, 92, 66, 3, 92, 36, 21, 82, 76, 96, 53, 94, 21, 37, 67, 25, 11, 84, 93, 58, 12, 17, 83, 3]


For a uniform distribution of numbers, the mean and median will likely be the same.

In [10]:
def mean(coll):
    return sum(coll) / len(coll)

In [16]:
def median(coll):
    sorted_coll = sorted(coll)
    n = len(coll)
    if n % 2 == 0: # even
#         return (sorted_coll[n // 2 - 1] + sorted_coll[n // 2]) / 2
        mid = n // 2
        return mean(sorted_coll[mid-1:mid+1])
    else: # odd
        return sorted_coll[n // 2]

In [12]:
mean([1, 2, 3, 4, 5])

3.0

In [13]:
median([1, 2, 3, 4, 5])

3

In [15]:
median([1, 2, 3, 4])

2.5

In [17]:
mean(nums)

45.975

In [18]:
median(nums)

40.0

In [31]:
from collections import Counter

def mode(coll):
    # make a dictionary
    counter = Counter(coll)
    max_val = max(counter.values())
    return [key for key, value in counter.items() if value == max_val]

In [32]:
mode(nums)

[20, 22]

The _quantile_ is the general form of the median. The _quantile_ for x% is the value where x% of the collection is under that value.

In [37]:
def quantile(coll, percent):
    idx = int(percent * len(coll))
    return sorted(coll)[idx]

In [38]:
quantile(nums, 0.2)

20

In [39]:
quantile(nums, 0.9)

84

In [40]:
quantile(nums, 1.0)

IndexError: list index out of range

# Range of data

There are several ways to see how spread out your data is.

In [42]:
def data_range(coll):
    return max(coll) - min(coll)

In [43]:
data_range(nums)

95

## Variance and standard deviation

The standard deviation (sometimes called sigma or $\sigma$) is a measurement of how spread out the numbers in a distribution are, but to calculate it, we need to know the _variance_.

The variance is the average of the squared differences from the mean.

Let's take some lifespans of cats to figure this out.

In [44]:
lifespans = [19, 16, 15, 10, 17, 19, 13, 10, 17, 11]

In [45]:
mean_lifespan = mean(lifespans)
mean_lifespan

14.7

In [46]:
data_range(lifespans)

9

In [47]:
def norm2mean(coll):
    collmean = mean(coll)
    return [element - collmean for element in coll]

In [53]:
print([round(x, 2) for x in norm2mean(lifespans)])

[4.3, 1.3, 0.3, -4.7, 2.3, 4.3, -1.7, -4.7, 2.3, -3.7]


In [55]:
def sum_of_squares(coll):
    return sum(x ** 2 for x in coll)

def variance(coll):
    """Measurement of distance from the mean."""
    n = len(coll)
    ncoll = norm2mean(coll)
    return sum_of_squares(ncoll) / (n - 1)

variance(lifespans)

12.233333333333334

In [56]:
import math

def stdev(coll):
    return math.sqrt(variance(coll))

In [57]:
stdev(lifespans)

3.497618237219913

The standard deviation is the square root of the variance.

Let's see if that matches what the `statistics` module gives us.

In [58]:
st.stdev(lifespans)

3.497618237219913

## What does the standard deviation really mean?

When you look at your data, ~68% should fall within 1 standard deviation of the mean. ~95% should fall within 2 standard deviations, and ~99.7% should fall within 3 standard deviations. You may hear this called the _68-95-99 rule_.

## Interquartile range

In [59]:
def interquartile_range(coll):
    return quantile(coll, 0.75) - quantile(coll, 0.25)

In [60]:
interquartile_range(lifespans)

6

In [61]:
import statistics as st

In [62]:
st.mean(lifespans)

14.7

# Further reading

* [Math is Fun - Statistics](http://www.mathsisfun.com/data/index.html)
* [Robert Niles' Statistics Guide](http://www.robertniles.com/stats/)