# Two formulas for standard deviation

Ivan Valiela in "Doing Science" points to two different ways of computing the standard deviation
as a measure of spread. So, now we have computers and spreadsheets, so we go with:

$ s = \sqrt{\frac{\sum_{i=1}^{n} (x_i - \overline{x})}{n-1}} $

If we need to add 1 element, we just start from the beginning - updating the $\overline{x}$, then the average square (variance) and dividing it by the increased $(n-1)$.

However, before these happy times we didn't want to recalculate this.

So, there was another formula:

$ s = \sqrt{\frac{\sum_{i=1}^{n} (x_i^2) - \frac{(\sum_{i=1}^{n} x_i)^2}{n}}{n-1}} $

(more or less, it's being described in natural language, and via a textual formula with ambiguous "scopes" of operations)

It should feature "sum of (data)^2 - \[(sum of data)^2 / number of data\].

So, we only need to track the sum of samples and sum of squared samples, and whenever we need the current $s$, we'd only do:

* one squaring
* two divisions
* one square root

I'd like to test this formula (and maybe debug it's definition) on a few sample populations.

In [10]:
import numpy as np

def stdev_simple(data, ddof=0):
    data = np.asarray(data)
    return np.sqrt(
        ((data ** 2).sum() - data.sum() ** 2 / len(data)) / (len(data) - ddof)
    )


In [11]:
x = [1, 2, 3, 4, 5]
np.std(x), stdev_simple(x)

(1.4142135623730951, 1.4142135623730951)

In [12]:
x = np.random.randint(10, size=(200,))
np.std(x), stdev_simple(x)

(2.8117565684105728, 2.8117565684105728)

In [13]:
np.std(x, ddof=1), stdev_simple(x, ddof=1)

(2.8188124303663664, 2.818812430366366)

## Moral of the story

This is not an approximation or a heuristic - it's just a different form.
It is indeed computation friendly by only keeping 2 state variables, which can
be updated in $O(1)$ after adding a single sample.