# Two formulas for standard deviation

Ivan Valiela in "Doing Science" points to two different ways of computing the standard deviation
as a measure of spread. So, now we have computers and spreadsheets, so we go with:

$ s = \sqrt{\frac{\sum\limits_{i=1}^{n} (x_i - \overline{x}) ^ 2}{n-1}} $

If we need to add 1 element, we just start from the beginning - updating the $\overline{x}$, then the average square (variance) and dividing it by the increased $(n-1)$.

However, before these happy times we didn't want to recalculate this.

So, there was another formula:

$ s = \sqrt{\frac{\sum\limits_{i=1}^{n} (x_i^2) - \frac{(\sum\limits_{i=1}^{n} x_i)^2}{n}}{n-1}} $

(more or less, it's being described in natural language, and via a textual formula with ambiguous "scopes" of operations)

It should feature "sum of (data)^2 - \[(sum of data)^2 / number of data\].

So, we only need to track the sum of samples and sum of squared samples, and whenever we need the current $s$, we'd only do:

* one squaring
* two divisions
* one square root

I'd like to test this formula (and maybe debug it's definition) on a few sample populations.

In [1]:
import numpy as np

def stdev_simple(data, ddof=0):
    data = np.asarray(data)
    return np.sqrt(
        ((data ** 2).sum() - data.sum() ** 2 / len(data)) / (len(data) - ddof)
    )


In [2]:
x = [1, 2, 3, 4, 5]
np.std(x), stdev_simple(x)

(1.4142135623730951, 1.4142135623730951)

In [3]:
x = np.random.randint(10, size=(200,))
np.std(x), stdev_simple(x)

(2.854781077420824, 2.854781077420824)

In [4]:
np.std(x, ddof=1), stdev_simple(x, ddof=1)

(2.8619449056919457, 2.8619449056919457)

## Moral of the story

This is not an approximation or a heuristic - it's just a different form.
It is indeed computation friendly by only keeping 2 state variables, which can
be updated in $O(1)$ after adding a single sample.

# Derivation of the formula

I've had some trouble transforming the definition into the formula "without mean under the sum", but it is quite easy to find on the Internet. There are event 

Some call it shortcut formula for variance: 

https://www.saddleback.edu/faculty/pquigley/math10/shortcut.pdf

or alternate variance formulas:

https://www.khanacademy.org/math/statistics-probability/summarizing-quantitative-data/variance-standard-deviation-population/v/statistics-alternate-variance-formulas 

(for population, with ddof=0, we get: σ² = ( (Σ x²) / N ) - μ²) 

shortcut variance formula: https://www.youtube.com/watch?v=9_NFIpsFkoo

But, instead of mouse-screen-hanwriting let's to it in proper LaTeX:

$ \sigma^2 = \frac{\sum\limits_{i=1}^{n}(x_i - \overline{x})^2}{n - 1} $

expand the square of the difference:

$ \sigma^2 = \frac{\sum\limits_{i=1}^{n}(x_i^2 \; - \; 2 \overline{x}x_i \; + \; \overline{x}^2)}{n - 1} $

split the sums:

$ \sigma^2 = \frac{
  \sum\limits_{i=1}^{n}(x_i^2) \; 
  - \; 2 \overline{x}\sum\limits_{i=1}^{n} x_i \; 
  + \; \overline{x}^2\sum\limits_{i=1}^{n} 1
}
{n - 1} $

now it's high time to substitute $\overline{x}$ with $\frac{\sum\limits_{i=1}^{n} x_i}{n}$,
as well as $\sum\limits_{i=1}^{n}1$ with $n$.

$ \sigma^2 = \frac{
  \sum\limits_{i=1}^{n}(x_i^2) \; 
  - \; 2 \frac{\sum\limits_{i=1}^{n} x_i}{n} \cdot \sum\limits_{i=1}^{n} x_i \; 
  + \; (\frac{\sum\limits_{i=1}^{n} x_i}{n})^2 \cdot n
}
{n - 1} $

let's substitute $ S_x = \sum\limits_{i=1}^{n} x_i $ for clarity:

$ \sigma^2 = \frac{
  \sum\limits_{i=1}^{n}(x_i^2) \; 
  - \; 2 \frac{S_x}{n} \cdot S_x \; 
  + \; (\frac{S_x}{n})^2 \cdot n
}{n - 1} $

the two terms acutally both have $ \frac{S_x^2}{n} $

$ \sigma^2 = \frac{
  \sum\limits_{i=1}^{n}(x_i^2) \; + \; \frac{S_x^2}{n} (-2 + 1)
}{n - 1} $

so finally:

$ \sigma^2 = \frac{\sum\limits_{i=1}^{n}(x_i^2) \; - \; \frac{S_x^2}{n}}{n - 1}$

expanding $S_x$ back to $\sum\limits_{i=1}^{n} x_i$:

$ \sigma^2 = \frac{\sum\limits_{i=1}^{n}(x_i^2) \; - \; \frac{(\sum\limits_{i=1}^{n} x_i)^2}{n}}{n - 1}$


## For population variance

$\sigma_p^2 = \frac{\sum\limits_{i=1}^{n}(x_i^2) \; - \; \frac{(\sum\limits_{i=1}^{n} x_i)^2}{n}}{n}$

We can do the division:

$\sigma_p^2 = \sum\limits_{i=1}^{n}(x_i^2) / n \; - \; \frac{(\sum\limits_{i=1}^{n} x_i)^2}{n^2}$

which is:

$\sigma_p^2 = \frac{\sum\limits_{i=1}^{n}(x_i^2)}{n} \; - \; (\frac{\sum\limits_{i=1}^{n} x_i}{n})^2$

and $\frac{\sum\limits_{i=1}^{n} x_i}{n} = \overline{x}$, so:

$\sigma_p^2 = \frac{\sum\limits_{i=1}^{n}(x_i^2)}{n} \; - \; \overline{x}^2$

Which produces a nice punchline "average of the squares minus square of the average".

# Sample covariance

The formula is actually very similar to the variance, and if $ x = y $ then $ \sigma_{xy} = \sigma_x^2 $

$ \sigma_{xy} = \frac{\sum\limits_{i=1}^n(x_i - \overline{x})(y_i - \overline{y}) }{n-1} $

This can be expanded to:

$ \sigma_{xy} = \frac{\sum\limits_{i=1}^n(x_i\cdot y_i - \overline{x}y_i - x_i\overline{y} + \overline{x}\overline{y}) }{n-1} $


When we split the sums, we get 4 sums:

$ \sigma_{xy} = \frac{
    \sum\limits_{i=1}^n x_i\cdot y_i - 
    \overline{x}\sum\limits_{i=1}^n y_i - 
    \overline{y}\sum\limits_{i=1}^n x_i + 
    \overline{x}\cdot\overline{y}\sum\limits_{i=1}^n 1 }{n-1} $

To make it more readable, let's substitute $S_x = \sum\limits_{i=1}^n x_i$ and $S_y = \sum\limits_{i=1}^n y_i$

$ \sigma_{xy} = \frac{
    \sum\limits_{i=1}^n x_i\cdot y_i - 
    \overline{x}S_y - 
    \overline{y}S_x + 
    \overline{x}\cdot\overline{y}\cdot n}{n-1} $
    
substituting $\overline{x} = \frac{S_x}{n}$ and $\overline{y} = \frac{S_y}{n}$, we have:

$ \sigma_{xy} = \frac{
    \sum\limits_{i=1}^n x_i\cdot y_i - 
    \frac{S_x}{n}S_y - 
    \frac{S_y}{n}S_x + 
    \frac{S_x}{n}\cdot\frac{S_y}{n}\cdot n}{n-1} $

$ \sigma_{xy} = \frac{
    \sum\limits_{i=1}^n x_i\cdot y_i - 
    \frac{S_x \cdot S_y}{n}(-1 - 1 + 1)}{n-1} $
    
Which can be expressed, by designating the sum of products $\sum\limits_{i=1}^n x_i\cdot y_i = S_{xy}$

$ \sigma_{xy} = \frac{S_{xy} - \frac{S_x S_y}{n}}{n-1} $

or with sums:

$ \sigma_{xy} = \frac{\sum\limits_{i=1}^n x_i\cdot y_i - \frac{\sum\limits_{i=1}^n x_i \cdot \sum\limits_{i=1}^n y_i}{n}}{n-1}$

Now, in short, if we have population covariance:

$ \sigma_{xy}^{(p)} = 
\frac{\sum\limits_{i=1}^n x_i\cdot y_i - \frac{\sum\limits_{i=1}^n x_i \cdot \sum\limits_{i=1}^n y_i}{n}}{n} = 
\frac{\sum\limits_{i=1}^n x_i\cdot y_i}{n} 
  - \frac{\sum\limits_{i=1}^n x_i}{n} \cdot \frac{\sum\limits_{i=1}^n y_i}{n} =
\overline{x \cdot y} - \overline{x} \cdot \overline{y}$

In words: "average of the product minus product of the averages". 

(Substitute "expectation" for "average" where applicable)