# Fundamentals of statistics in Python
## Calculating descriptive statistics of single variable

***
<br>

## Preparation of the working environment

1. Importing the statistics packages

In [1]:
import math
import statistics
import numpy as np
import scipy.stats
import pandas as pd

2. Creation of some data to work with
* Lists `x` and `x_with_nan` almost the same, with the difference `that x_with_nan` contains a `nan` value. It’s important to understand the behavior of the Python statistics routines when they come across a not-a-number value (`nan`).

In [2]:
x = [8.0, 1, 2.5, 4, 28.0]
x_with_nan = [8.0, 1, 2.5, math.nan, 4, 28.0]

3. Creation of `np.ndarray` and `pd.Series` objects that correspond to `x` and `x_with_nan`

In [3]:
y, y_with_nan = np.array(x), np.array(x_with_nan)
z, z_with_nan = pd.Series(x), pd.Series(x_with_nan)

## Measures of central tendency

* The measures of central tendency show the central or middle values of datasets. 

#### Mean

* The __sample mean__, are also called the __sample arithmetic mean__ or simply the __average__.
* It is the arithmetic average of all the items in a dataset.
* In other words, it’s the sum of all the elements form dataset divided by the number of items in the dataset.
* You can calculate the mean with pure Python using `sum()` and `len()` or using a function from one of the libraries.

In [4]:
pure_mean = sum(x)/len(x)
print(pure_mean)

8.7


In [5]:
st_mean = statistics.mean(x)
print(st_mean)

8.7


In [6]:
np_mean = np.mean(y)
print(np_mean)

8.7


In [7]:
pd_mean = z.mean()
print(pd_mean)

8.7


* What about a dataset containing `nan`?
    * `statistics` package does not return a valid result
    * in the `numpy` package we find a special function which takes into account the presence of `nan`s
    * `pandas` takes into account the possibility of `nan`s by default

In [8]:
statistics.mean(x_with_nan)

nan

In [9]:
np.mean(y_with_nan)

nan

In [10]:
np.nanmean(y_with_nan)

8.7

In [11]:
z_with_nan.mean()

8.7

#### Weighted mean

* The __weighted mean__, are also called the __weighted arithmetic mean__ or __weighted average__.
* It is a generalization of the arithmetic mean that enables you to define the relative contribution of each data point to the result.

In [12]:
x = [8.0, 1, 2.5, 4, 28.0]
weights = [0.1, 0.2, 0.3, 0.25, 0.15]

wmean = sum(x_ * w_ for (x_, w_) in zip(x, weights)) / sum(weights)
print(wmean)

6.95


In [13]:
y, z, weights = np.array(x), pd.Series(x), np.array(weights)

wmean = np.average(y, weights=weights)
print(wmean)

6.95


#### Harmonic mean

* The __harmonic mean__ is the reciprocal of the mean of the reciprocals of all items in the dataset.

In [14]:
hmean = len(x) / sum(1 / item for item in x)
print(hmean)

2.7613412228796843


In [15]:
hmean = statistics.harmonic_mean(x)
print(hmean)

2.7613412228796843


In [16]:
scipy.stats.hmean(y), scipy.stats.hmean(z)

(2.7613412228796843, 2.7613412228796843)

#### Geometric mean

* The __geometric mean__ is the 𝑛-th root of the product of all elements in a dataset. 

In [17]:
gmean = 1
for item in x:
    gmean *= item
gmean **= 1 / len(x)
print(gmean)

4.677885674856041


In [18]:
gmean = statistics.geometric_mean(x)
print(gmean)

4.67788567485604


In [19]:
scipy.stats.gmean(y), scipy.stats.gmean(z)

(4.67788567485604, 4.67788567485604)

#### Median

* The __sample median__ is the middle element of a sorted dataset.
* The dataset can be sorted in increasing or decreasing order.
* If the number of elements of the dataset is odd, then the median is the value at the middle position.
* If the number of elements of the datasetis  even, then the median is the arithmetic mean of the two values in the middle.

In [20]:
n = len(x)
if n % 2:
    pure_median = sorted(x)[round(0.5*(n-1))]
else:
    x_ord, index = sorted(x), round(0.5 * n)
    pure_median = 0.5 * (x_ord[index-1] + x_ord[index])
print(pure_median)

4


In [21]:
st_median = statistics.median(x)
print(st_median)

4


In [22]:
np.median(y)

4.0

In [23]:
z.median()

4.0

* `median_low()` and `median_high()` are two more functions related to the median in the Python `statistics` library.
* They always return an element from the dataset:
    * If the number of elements is odd, then there’s a single middle value, so these functions behave just like `median()`.
    * If the number of elements is even, then there are two middle values. In this case, `median_low()` returns the lower and `median_high()` the higher middle value.

In [24]:
statistics.median_low(x[:-1]), statistics.median_high(x[:-1])

(2.5, 4)

#### Mode

* The __sample mode__ is the value in the dataset that occurs most frequently.
* If there isn’t a single such value, then the set is multimodal since it has multiple modal values.

In [25]:
u = [2, 3, 2, 8, 12]
pure_mode = max((u.count(item), item) for item in set(u))[1]
print(pure_mode)

2


In [26]:
statistics.mode(u)

2

In [27]:
statistics.multimode(u)

[2]

In [28]:
v = [12, 15, 12, 15, 21, 15, 12]
#statistics.mode(v) raises StatisticsError
statistics.multimode(v)

[12, 15]

In [29]:
u, v = np.array(u), np.array(v)
scipy.stats.mode(u)

ModeResult(mode=array([2]), count=array([2]))

In [30]:
# If there are multiple modal values in the dataset, then only the smallest value is returned.
scipy.stats.mode(v)

ModeResult(mode=array([12]), count=array([3]))

## Measures of variability

* The measures of central tendency aren’t sufficient to describe data.
* You’ll also need the measures of variability that quantify the spread of data points.

#### Variance

* The __sample variance__ quantifies the spread of the data.
* It shows numerically how far the data points are from the mean.

In [31]:
n = len(x)
pure_mean = sum(x) / n
pure_var = sum((item - pure_mean)**2 for item in x) / (n - 1)
print(pure_var)

123.19999999999999


In [32]:
statistics.variance(x)

123.2

In [33]:
np.var(y, ddof=1)

123.19999999999999

* It’s very important to specify the parameter `ddof=1` in `np.var` function. That’s how you set the __delta degrees of freedom__ to 1. This parameter allows the proper calculation of variance, with `(𝑛 − 1)` in the denominator instead of `𝑛`.

In [34]:
z.var(ddof=1)

123.19999999999999

#### Standard Deviation

* The __sample standard deviation__ is another measure of data spread.
* It’s connected to the sample variance, as standard deviation, is the positive square root of the sample variance.
* The standard deviation is often more convenient than the variance because it has the same unit as the data points.

In [35]:
pure_std = pure_var**0.5
print(pure_std)

11.099549540409285


In [36]:
statistics.stdev(x)

11.099549540409287

In [37]:
np.std(y, ddof=1)

11.099549540409285

In [38]:
z.std(ddof=1)

11.099549540409285

#### Percentiles

* The sample `𝑝` __percentile__ is the element in the dataset such that `𝑝%` of the elements in the dataset are less than or equal to that value. Also, `(100 − 𝑝)%` of the elements are greater than or equal to that value.
* If there are two such elements in the dataset, then the sample `𝑝` percentile is their arithmetic mean.
* Each dataset has three __quartiles__, which are the percentiles that divide the dataset into four parts.

In [39]:
x = [-5.0, -1.1, 0.1, 2.0, 8.0, 12.8, 21.0, 25.8, 41.0]
statistics.quantiles(x, n=2)

[8.0]

In [40]:
statistics.quantiles(x, n=4, method='inclusive')

[0.1, 8.0, 21.0]

In [41]:
y = np.array(x)
np.percentile(y, 5), np.percentile(y, 95)

(-3.44, 34.919999999999995)

In [42]:
np.percentile(y, [25, 50, 75])

array([ 0.1,  8. , 21. ])

In [43]:
np.quantile(y, [0.25, 0.5, 0.75])

array([ 0.1,  8. , 21. ])

In [44]:
z = pd.Series(y)
z.quantile([0.25, 0.5, 0.75])

0.25     0.1
0.50     8.0
0.75    21.0
dtype: float64

#### Ranges

* The __range of data__ is the difference between the maximum and minimum element in the dataset.
* The interquartile range is the difference between the first and third quartile.

In [45]:
np.ptp(y)

46.0

In [46]:
y.max() - y.min()

46.0

In [47]:
quartiles = np.quantile(y, [0.25, 0.75])
quartiles[1] - quartiles[0]

20.9

## --- Exercise ---

Using the appropriate functions from the `numpy` library, determine the values of the arithmetic mean, median, quartiles, variance and standard deviation for the given sample of data.

In [None]:
import numpy as np

data_sameple = np.array([10, 15, 17, 90, 9.5, 12.7, 13.2, 88])

# write your code here