In [1]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 

### Covariance

In Characterising1D we discussed about variance (the average squared deviation from mean). Covariance is as you have gussed, similar. Suppose we have a data vectoe $x^a$, which has $i$ points... so $x_i^a$ is the first element of data vector. From the previous discussions we have that 

$$ Var^{a,a} = \frac{1}{N-1} \sum_{i=1}^N (x_i^a - \mu^a)(x_i^a - \mu^a), $$

Another way of stating this is that this is covariance of vector $x^a$ with itself. Notice there are two sets of brackets, bboth use data vector $x^a$.Covariancee is what you get when you change one of the letters. 

$$ Var^{a,b} = \frac{1}{N-1} \sum_{i=1}^N (x_i^a - \mu^a)(x_i^b - \mu^b), $$

Easy! All we have done now is one set of brackets iterates over a different data vector. The goal is to do this for each different vector you have to form a matrix if we had only two vectors our matrix is this:

$$ Cov = \begin{pmatrix} Var^{a,a} & Var^{a,b} \\ Var^{b,a} & Var^{b,b} \\ \end{pmatrix} $$

Notice how this is symetric. $Var^{a, b} = Var^{b, a}$ and the diagonals are justt variance for each data vector. The off diagonals are a measure of joint spread between the two. 

We can calculate covariance using `np.cov` or `pd.DataFrame.cov`

In [2]:
dataset = pd.read_csv("../data/characterising/height_weight.csv")[["height", "weight"]]
dataset.head()

Unnamed: 0,height,weight
0,71.74,259.88
1,71.0,186.73
2,63.83,172.17
3,67.74,174.66
4,67.28,169.2


In [4]:
covariance = np.cov(dataset, rowvar=False)
print(covariance)

[[  18.60200779   78.50218098]
 [  78.50218098 1512.91208783]]


In [5]:
covariance = dataset.cov()
print(covariance)

           height       weight
height  18.602008    78.502181
weight  78.502181  1512.912088


### Correlation

Correlation and covariance are easily linked. If we take a 2D covvariance matrix from above, which is written in terms of variance we can rewrite it in terms of standard deviation $\sigma$ as $Var = \sigma^2$. 

$$ Cov = \begin{pmatrix} \sigma^2_{a,a} & \sigma^2_{a,b} \\ \sigma^2_{b,a} & \sigma^2_{b,b} \\ \end{pmatrix} $$

And here is the corellation matrix 

$$ Corr = \begin{pmatrix} \sigma^2_{a,a}/\sigma^2_{a,a} & \sigma^2_{a,b}/(\sigma_{a,a}\sigma_{b,b}) \\ \sigma^2_{b,a}/(\sigma_{a,a}\sigma_{b,b}) & \sigma^2_{b,b}/\sigma^2_{b,b} \\ \end{pmatrix} $$

which is the same as 

$$ Corr = \begin{pmatrix} 1 & \rho_{a,b} \\ \rho_{b,a} & 1 \\ \end{pmatrix}, $$

where $\rho_{a,b} = \sigma^2_{a,b}/(\sigma_{a,a}\sigma_{b,b})$. Another way to think about this is that 

$$ Corr_{a,b} = \frac{Cov_{a,b}}{\sigma_a \sigma_b} $$

It is the joint variablity normalised by the variablity of each independent variable.

We can calculate the correlation matrix using `np.corrcoeff` or `pd.DataFrame.corr`

In [8]:
corr = np.corrcoef(dataset.T)
print(corr)

[[1.         0.46794517]
 [0.46794517 1.        ]]


In [9]:
corr = dataset.corr()
print(corr)

          height    weight
height  1.000000  0.467945
weight  0.467945  1.000000


So what does this mean? In simple terms the fact that we have a positive number for height weight correlation means that on an average a taller person weighs more (than a shorter person). A shorter person on an aveerage weighs less than a tall person.

If the number was negative it would mean opposite.

* **Age vs number of pregnancies**: Positive correlation
* **Temperature in Celcius vs Temperature in Kelvin**: Total positive correlation ($1.0$)
* **Amount of cigarettes smoked vs Life expectance**: Negative correlation
* **Height vs Comfort on plane seats**: Negative correlation
* **Number of units purchased vs Cost of individual unit**: Hopefully a negative correlation!

Take two things and ask yourself, if one goes up, do I expect the other to go, down or not change? 

This is correlation.