In [1]:
import numpy as np

## Variance
(from [http://www.mathsisfun.com/data/standard-deviation.html](http://www.mathsisfun.com/data/standard-deviation.html))

 > The average of the squared differences from the Mean.
 
To calculate the variance follow these steps:

 1. Work out the Mean (the simple average of the numbers)
 2. Then for each number: subtract the Mean and square the result (the squared difference).
 3. Then work out the average of those squared differences. ([Why Square?](http://www.mathsisfun.com/data/standard-deviation.html#WhySquare))

In [50]:
dogs = np.array([600, 470, 170,  430, 300])

dogs_mean = np.mean(dogs)
print("Mean", dogs_mean)

dogs_diff = dogs - dogs_mean
dogs_variance = np.sum(dogs_diff ** 2) / len(dogs)
print("Variance %f (%f)" % (dogs_variance, np.var(dogs)))

dogs_std = np.sqrt(dogs_variance)
print("Std %f (%f)" % (dogs_std, np.std(dogs)))

Mean 394.0
Variance 21704.000000 (21704.000000)
Std 147.322775 (147.322775)


### But ... there is a small change with Sample Data  (degrees of freedom)
Our example has been for a Population (the 5 dogs are the only dogs we are interested in). But if the data is a Sample (a selection taken from a bigger Population), then the calculation changes!

When you have $N$ data values that are:

 - **The Population**: divide by $N$ when calculating *Variance* (like we did)
 - **A Sample**: divide by $N-1$ when calculating *Variance*

## Covariance
(from [https://www.itl.nist.gov/div898/handbook/pmc/section5/pmc541.htm](https://www.itl.nist.gov/div898/handbook/pmc/section5/pmc541.htm))

#### Sample data matrix

Consider the following matrix:

$${\bf X} = \left[ \begin{array}{ccc} 
4.0 & 2.0 & 0.60 \\
4.2 & 2.1 & 0.59 \\
3.9 & 2.0 & 0.58 \\
4.3 & 2.1 & 0.62 \\
4.1 & 2.2 & 0.63   
\end{array} \right]$$

In [55]:
X = np.array([ 
[4.0, 2.0, 0.60],
[4.2, 2.1, 0.59],
[3.9, 2.0, 0.58],
[4.3, 2.1, 0.62],
[4.1, 2.2, 0.63]])

The set of 5 observations, measuring 3 variables, can be described by its mean vector and variance-covariance matrix. The three variables, from left to right are length, width, and height of a certain object, for example. Each row vector $X_i$ is another observation of the three variables (or components).

#### Definition of mean vector and variance-covariance matrix

The mean vector consists of the means of each variable and the variance-covariance matrix consists of the *variances* of the variables along the main diagonal and the *covariances* between each pair of variables in the other matrix positions.  

The formula for computing the covariance of the variables $X$ and $Y$ is:

$$\mbox{COV} = \frac{\sum_{i=1}^n (X_i - \bar{x})(Y_i - \bar{y})}{n-1} \, ,$$

with $\bar{x}$ and $\bar{y}$ denoting the means of $X$ and $Y$, respectively.

#### Mean vector and variance-covariance matrix for sample data matrix

In [56]:
X_mean = np.mean(X, axis=0)
print(X_mean)

[4.1   2.08  0.604]


$${\bf \bar{x}} = \left[ \begin{array}{ccc} 
4.10 & 2.08 & 0.604
\end{array} \right]$$

In [104]:
X_0 = X[:,0]
X_1 = X[:,1]
X_2 = X[:,2]

def calc_variance(x_1, x_2):
    diff_1 = x_1 - np.mean(x_1)
    diff_2 = x_2 - np.mean(x_2)
    return np.sum(diff_1 * diff_2) / (len(x_1)-1)

# "Manual"
cov = []
for i in range(X.shape[1]):
    row = []
    for j in range(X.shape[1]):
        row.append(calc_variance(X[:,i], X[:,j]))
    cov.append(row)
    
print(np.array(cov))
print()
print(np.cov(X, rowvar=False))

[[0.025   0.0075  0.00175]
 [0.0075  0.007   0.00135]
 [0.00175 0.00135 0.00043]]

[[0.025   0.0075  0.00175]
 [0.0075  0.007   0.00135]
 [0.00175 0.00135 0.00043]]


$${\bf S} = \left[ \begin{array}{ccc} 
0.025 & 0.0075 & 0.00175 \\
0.0075 & 0.0070 & 0.00135 \\
0.00175 & 0.00135 & 0.00043   
\end{array} \right]$$