# Quantifying scatter

In [1]:
import numpy as np

data = np.array([64.7, 65.4, 88.3, 64, 71.9])
#np.concatenate(array_1, array_2, array_3)

In [3]:
print(f"The mean of the dataset is {np.mean(data):.2f} and its standard deviation {np.std(data, ddof=1):.2f}")
# by default ddof=0 with np.std()

The mean of the dataset is 70.86 and its standard deviation 10.25


The standard deviation accounts for the variation among the values, with an estimation of the spread of the distribution:
$$\text{SD}=\sqrt{\frac{\sum(Y_i - \bar{Y})^2}{n-1}}$$

We can interpret the SD from the 3-$\sigma$ rule of thumb: about 2/3 of the observations in a population usually lie within the rande defined by the mean minus 1 SD to the mean plus 1 SD. So that we have the follwing intervals:
* \[-1 SD; 1 SD\] = 68%
* \[-2 SD; 2 SD\] = 95%
* \[-3 SD; 3 SD\] = 99.7%

In [4]:
np.percentile(data, q=[25, 75])

array([64.7, 71.9])

In [18]:
np.var(data, ddof=1)

105.01299999999995

## More statistics using scipy.stats

In [10]:
from scipy import stats

### IQR

In [11]:
stats.iqr(data)

7.200000000000003

### descriptive summary

In [8]:
stats.describe(data) # ->nobs, minmax, mean, variance, skewness, kurtosis

DescribeResult(nobs=5, minmax=(64.0, 88.3), mean=70.86000000000001, variance=105.01299999999995, skewness=1.1912013156755001, kurtosis=-0.24972334599204293)

### SEM

SEM quantifies how precisely we know the population mean, with the SEM from one sample the best estimate of what SD among sample means would be if we collected an infinite number of samples of a defined size (think about W = t * SEM)

$$SEM=\frac{SD}{\sqrt{n}}$$

In [9]:
stats.sem(data) # standard error of the mean, i.e. SEM = SD / sqrt(n)

4.582859369433017

### Z-score

Or how many SD distant from the mean; we considere an outlier data point with a |Z| > 3.

In [12]:
stats.zscore(data, ddof=0)

array([-0.6720695 , -0.59569796,  1.90274222, -0.74844103,  0.11346628])

### Coefficient of variation (CV)

CV equals the SD divided by the mean; if CV = 0.25, we know that the SD is 25% of the mean (a measure of variability).

In [13]:
stats.variation(data)

0.1293496858434382

### Geometric mean and standard deviation

In [16]:
stats.gmean(array) # geometric mean, same as
10**np.mean(np.log10(array))

85.57807971033304

In [19]:
10**np.std(np.log10(array)) # gives the GSD (no unit)

1.1038273439921256

The result is GM*/GSD, i.e. 2/3 of the values in this distribution are within GM/GSD and GM*GSD.

The log of the product of 2 values equals the sum of the log of the 1st value + the log of the 2nd value; log converts multiplicative scatter (lognormal dist) to additive scatter (Gaussian). Lognormal dist are common, e.g. potentcy of drug (EC50, IC50, Km, Ki etc.), blood serum concentrations of many natural or toxic compounds etc.

### Weighted statistics

In [15]:
from statsmodels.stats.weightstats import DescrStatsW

array = [90, 80, 75, 100, 85]
weights=[.3, .2,.15, .15, .2]

weighted_stats = DescrStatsW(array, weights=weights, ddof=1)
print(weighted_stats.mean, weighted_stats.std, weighted_stats.var, sep='\n')

86.25
inf
inf
