$$Statistics$$

*The entire subject of statistics is based around the idea that you have this big set of data,
and you want to analyse that set in terms of the relationships between the individual
points in that data set. I am going to look at a few of the measures you can do on a set
of data, and what they tell you about the data itself.*

### Standart Deviation

*To understand standard deviation, we need a data set. Statisticians are usually concerned with taking a sample of a population. To use election polls as an example, the population is all the people in the country, whereas a sample is a subset of the population that the statisticians measure. The great thing about statistics is that by only measuring (in this case by doing a phone survey or similar) a sample of the population, you can work out what is most likely to be the measurement if you used the entire population. In this statistics section, I am going to assume that our data sets are samples of some bigger population. There is a reference later in this section pointing to more information about samples and populations.*

- Here's an example set:

In [3]:
import numpy as np

In [1]:
X = [1, 2, 3, 5, 6, 8, 10, 16, 22, 26, 35, 38, 42, 54, 63, 88]

**How do we calculate it? The English definition of the SD is: “The average distance
from the mean of the data set to a point”. The way to calculate it is to compute the
squares of the distance from each data point to the mean of the set, add them all up,
divide by *n-1*, and take the positive square root. As a formula:**

$$s=\sqrt{\frac{1}{N-1}\sum_{i=1}^N(x_i-\bar{x})^2}$$

In [4]:
np.mean(X)

26.1875

In [14]:
"""Calculates the population standard deviation by default;
   specify ddof=1 to compute the sample standard deviation."""
np.std(X, ddof=1) 

25.48782389037296

*the mean doesn’t tell us a lot about the data except for a sort of
middle point. For example, these two data sets have exactly the same mean (10), but
are obviously quite different:*

In [6]:
y = [0, 8, 12, 20]
x = [8, 9, 11, 12]

In [7]:
np.mean(y)==np.mean(x)

True

In [12]:
np.std(x, ddof=1)

1.8257418583505538

In [13]:
np.std(y, ddof=1)

8.32666399786453

### Variance

Variance is another measure of the spread of data in a data set. In fact it is almost
identical to the standard deviation.You will notice that this is simply the standard deviation squared, in both the symbol
*s^2* and the formula (there is no square root in the formula for variance). The formula is this:

$$var_{x}= {\frac{1}{N-1}\sum_{i=1}^N(x_i-\bar{x})^2}$$

In [22]:
np.var(x, ddof=1)

3.3333333333333335

### Covariance

*The last two measures we have looked at are purely 1-dimensional. Data sets like this could be: heights of all the people in the room, marks for the last LINALG101 exam etc. However many data sets have more than one dimension, and the aim of the statistical analysis of these data sets is usually to see if there is any relationship between the dimensions. For example, we might have as our data set both the height of all the students in a class, and the mark they received for that paper. We could then perform statistical analysis to see if the height of a student has any effect on their mark.*

$$cov_{x,y}=\frac{\sum_{i=1}^{N}(x_{i}-\bar{x})(y_{i}-\bar{y})}{N-1}$$

In [24]:
np.cov(x, y, ddof=1)

array([[ 3.33333333, 14.66666667],
       [14.66666667, 69.33333333]])