# Chapter 1 to 3 Notes

* Statistics is the science of evaluating population parameters using sample estimates. The word *parameter* comes from Greek and means "beyond measure". Populations of interest cannot be evauated directly and, hence, the need for statistics.  


* An important statistic is *mean*, which provides a central tendency of the population. Mean provides a typical value and is used to estimate the population average. 


* Mean alone does not provide an adequate summary statistic. Consider the claim of a fictious matchbox company called *Matchless Matches* that advertises its matchboxes as having 48 matches on average. You buy 6 matchboxes from this company and find that they contain 48, 49, 49, 47, 48 and 47 matches. So far so good. Another matchbox company, *Mighty Matches*, also claims to make matchboxes with 48 matches on average per matchbox. Again, you buy six matches from this company and find that they contain 12, 62, 3, 50, 93 and 68 matches, which average to 48. However, the number of matches in *Mighty Matches'*s case vary a lot. Another statistic is required to capture this type of variation.


* One way of measuring variation in data is by subtracting the lowest value from the highest value in the data. This statistic is known as range. The range in the sample matchboxes of <i>Matchless Matches</i> was (49 - 47 = ) 2 and (93 - 12 =) 81 in those of <i>Mighty Matches</i>. However, range only uses 2 data points from the sample. It would be better if all the data points in the sample could be used, just as in the case of mean.


* If all the matchboxes from the above examples had 48 matches, which happens to be the mean in both examples, then there would be no variation. Hence, one way of calculating variation is by finding out how much each matchbox deviates from the mean and adding them. This would also show the contribution of each matchbox to the total variation.

What follows is a calculation of mean and total deviation of <i>Matchless Matches</i> and <i>Mighty Matches</i> samples from the above examples.

In [1]:
import numpy as np

matchless = np.array([48, 49, 49, 47, 48, 47])
mighty = np.array([12, 62, 3, 50, 93, 68])

In [2]:
# save means
matchless_mean = np.mean(matchless)
mighty_mean = np.mean(mighty)

In [3]:
matchless_mean, mighty_mean

(48.0, 48.0)

Calculate total deviations, which is a measure of variation.

In [4]:
matchless_dev = np.sum(matchless - matchless_mean)
matchless_dev

0.0

In [5]:
mighty_dev = np.sum(mighty - mighty_mean)
mighty_dev

0.0

If the signs of the deviations from the mean are ignored, the variation in the matchbox samples become apparent.

In [6]:
matchless_dev_abs = np.sum(np.abs(matchless - matchless_mean))
matchless_dev_abs

4.0

In [7]:
mighty_dev_abs = np.sum(np.abs(mighty - mighty_mean))
mighty_dev_abs

162.0

* There are 2 issues with total deviation as calculated above. First, it depends on sample size (<i>n</i>). As the sample size grows, so does total variation. That means only variations derived from equal-sized samples can be compared, unlike mean. Second, negative and positive deviations from the mean, when summed, cancel each other out, masking the true magnitude of variation in the data, as happened in the above examples when absolute values of the deviations from the mean were not taken.


* The first issue raised above can be addressed by taking the mean of the deviations by dividing it by sample size. This gives mean deviation.


* The second issue can be addressed by squaring the deviations from the mean prior to summing them. As in the example of the mean deviation, the sum is normalised by dividing it by *n* - 1, where *n* is the sample size. This gives *variance*, a key statistic.


* The denominator *n* - 1 in the calculation of *variance* is known as the degree of freedom.


* Since *variance* involves squaring deviations from the mean, it produces squares of whatever unit is being used to describe sample. For example, if the sample data represents heights of people in metres, then *variance* produces squarers of metres. This does not make sense. 


* Standard deviation is the square root of *variance*. Together with *mean* and *variance*, standard deviation is an important statistic, "the key to the statistical lock." A population from which the sample was drawn can be summarised by the *mean+-standard deviation*.


In [8]:
# calculate absolute mean 
matchless_abs_mean_dev = np.sum(np.abs(matchless - matchless_mean)) / len(matchless)
matchless_abs_mean_dev

0.66666666666666663

In [9]:
mighty_abs_mean_dev = np.sum(np.abs(mighty - mighty_mean)) / len(mighty)
mighty_abs_mean_dev

27.0

In [10]:
# calculate variance
matchless_var = np.sum((matchless - matchless_mean)**2) / (len(matchless) - 1)
matchless_var

0.80000000000000004

In [11]:
mighty_var = np.sum((mighty - mighty_mean)**2) / (len(mighty) - 1)
mighty_var

1189.2

In [12]:
# calculate standard deviations
matchless_std = np.sqrt(matchless_var)
matchless_std


0.89442719099991586

In [13]:
mighty_std = np.sqrt(mighty_var)
mighty_std

34.484779251142093

The standard deviations show that on average, the number of matches in a *Matchless* matchbox deviate by 0.82 from the mean whereas a *Mighty* matchbox deviates from the mean by 31.48. Clearly, *Mighty* is not match to *Matchless* when it comes to consistency and customer experience.

## Chapter 3 spare-time activities 

1 - What is the variance of the following numbers, which total 90?
        9, 10, 13, 6, 8, 12, 13, 10, 9

In [14]:
sample = np.array([9, 10, 13, 6, 8, 12, 13, 10, 9])
samp_mean = np.mean(sample)

In [15]:
np.sum(((sample - samp_mean)**2))/(len(sample) - 1)

5.5

In [16]:
# using numpy
np.var(sample, ddof=1)

5.5

2 - Express the following set of samples as their mean +- standard deviation.
          > 1, 3, 2, 6, 4, 4, 5, 7, 6, 4, 4, 5, 3, 5, 3, 2
        
    Find out how many standard devations from the mean would an observation as large as 8 represent?

In [17]:
sample_2 = np.array([1, 3, 2, 6, 4, 4, 5, 7, 6, 4, 4, 5, 3, 5, 3, 2])
sample_std = np.std(sample_2, ddof=1)
sample_std

1.6329931618554521

In [18]:
sample_mean = np.mean(sample_2)
sample_mean

4.0

In [19]:
# express each number in terms of standard deviations from the mean
(sample_2 - sample_mean) / sample_std

array([-1.83711731, -0.61237244, -1.22474487,  1.22474487,  0.        ,
        0.        ,  0.61237244,  1.83711731,  1.22474487,  0.        ,
        0.        ,  0.61237244, -0.61237244,  0.61237244, -0.61237244,
       -1.22474487])

In [20]:
# find out how many standard devations from the mean would an observation as large as 8 represent
(8 - sample_mean) / sample_std

2.4494897427831779