## Python statistical library:

+ **Statistics** is a built-in library in Python for descriptive statistics. We can use this library when dataset is not too large or **basic statistic operations**.
+ **Numpy** is a third-party library for **numerical computation and optimizing the n dimensional array operations**. It has lot of built in statistical functions.
+ **Scipy** is a third-party library for **scientific computation** on top of Numpy library. It offers scientific and more statistics functions compare to Numpy.

In [1]:
import math
import statistics
import numpy as np
import scipy.stats
import pandas as pd

### not a number (nan) value:

If we have a missing value in our dataset then those values are represented by not a number (nan) value. <br>
In data science, missing values are common and we will often replace them with nan.  <br>
We have a lot of missing value imputation technique in python like mean, median, mode, etc.,

In [2]:
# In Python, we can use any of the following:

math.nan
np.nan

nan

In [3]:
# Define list with nan values

x = [1, 2, np.nan, 4, math.nan]
x

[1, 2, nan, 4, nan]

### Measures of Central Tendency

The measures of central tendency show the central or middle values of datasets. <br>
There are several techniques to find measure of central tendency. <br>
+ Mean
+ Median
+ Mode
+ Weighted mean
+ Geometric mean
+ Harmonic mean

#### 1. Mean
The sample mean, also called the **sample arithmetic mean or simply the average**, is the arithmetic average of all the items in a dataset. The mean of a dataset 𝑥 is mathematically expressed as **Σᵢ𝑥ᵢ/𝑛**, where 𝑖 = 1, 2, …, 𝑛. In other words, it’s the sum of all the elements 𝑥ᵢ divided by the number of items in the dataset 𝑥.

![image.png](attachment:image.png)

In [4]:
# List
x = np.array([1, 2, 3, 4, 5])

# Statistics mean --> statistics.mean(x)
print("Statistics mean:",statistics.mean(x)) 

# Numpy mean --> np.mean(x)
print("Numpy mean:",np.mean(x))

# When we have nan value in the list then mean value will nan
x = [1, 2, 3, 4, np.nan, 5, 6, math.nan]

# Statistics mean with nan value --> statistics.mean(x)
print("Statistics mean with nan value:",statistics.mean(x)) 

# Numpy mean --> np.mean(x)
print("Numpy mean with nan value:",np.mean(x))

Statistics mean: 3
Numpy mean: 3.0
Statistics mean with nan value: nan
Numpy mean with nan value: nan


In [5]:
# We can ignore the nan value, while finding mean value. But this function is available only in Numpy.
# Actuallly, it is ignoring the nan values from the list while calculating average value
print("Statistics mean with nan value:",np.nanmean(x))

Statistics mean with nan value: 3.5


In [6]:
# Pandas library also having function for mean() calculation
# But we can apply this function only for pandas declared variable like Series, Dataframe columns
# By default pandas mean() function ignore the nan values without external defenition
x = pd.Series(x)
print("Pandas mean with nan value:",x.mean())

Pandas mean with nan value: 3.5


#### 2. Median
Median is the **middle element of the sorted datapoint**. The dataset can be sorted either in ascending or descending order. If the dataset having **odd number of total elements**, then the middle element will be at **(n-1) *0.5**. If the dataset having **even number of elements**, then the middle element is the **mean of two elements n*0.5 and (n+1) *0.5**.

For example, if you have the data points 2, 4, 1, 8, and 9, then the median value is 4, which is in the middle of the sorted dataset (1, 2, 4, 8, 9). If the data points are 2, 4, 1, and 8, then the median is 3, which is the average of the two middle elements of the sorted sequence (2 and 4). The following figure illustrates this:

![image.png](attachment:image.png)

The data points are the green dots, and the purple lines show the median for each dataset. The median value for the upper dataset (1, 2.5, 4, 8, and 28) is 4. If you remove the outlier 28 from the lower dataset, then the median becomes the arithmetic average between 2.5 and 4, which is 3.25. <br>
The figure below shows both the mean and median of the data points 1, 2.5, 4, 8, and 28:

![image.png](attachment:image.png)

The **main difference between the behavior of the mean and median is related to dataset outliers**. The **mean is heavily affected by outliers, but the median is slightly affected by outliers.**

In [7]:
# Pure Python implementations code

x = np.array([1,3,4,2,5,7,6,8])
n = len(x)
if n%2 != 0:
    median_ = sorted(x)[round(0.5*(n-1))]
else:
    x_sort, idx = sorted(x), round(0.5*n)
    median_ = 0.5 * (x_sort[idx-1] + x_sort[idx])
median_

4.5

In [8]:
# Median calculation using Statistics library
print("Median using statistics function:",statistics.median(x))

#Median calculation using Numpy library
print("Median using Numpy function:",np.median(x))

Median using statistics function: 4.5
Median using Numpy function: 4.5


In [9]:
# Median low & high: Statistics library provide option to find the low & high value when the length is even
# This function only available statistics
print("Median low using statistics function:",statistics.median_low(x))
print("Median high using statistics function:",statistics.median_high(x))

Median low using statistics function: 4
Median high using statistics function: 5


In [10]:
x = np.array([1,2,3,np.nan,5,6,math.nan])

# Statistics library: Median with nan value
print("Median with nan using statistics function:",statistics.median(x))

# Numpy library: Median with nan value
print("Median with nan using Numpy function:",np.median(x))
print("Median with nan using Numpy function:",np.nanmedian(x))

# Pandas library: Median with nan value
# But we can apply this function only for pandas declared variable like Series, Dataframe columns
# By default pandas median() function ignore the nan values without external defenition
x = pd.Series(x)
print("Median with nan using Pandas's function:",x.median())

Median with nan using statistics function: nan
Median with nan using Numpy function: nan
Median with nan using Numpy function: 3.0
Median with nan using Pandas's function: 3.0


#### 3. Mode
The sample mode is the **value in the dataset that occurs most frequently**. There might be multiple mode values in the dataset. For example, in the set that contains the points 2, 3, 2, 8, 3 and 12: the number 2 & 3 are the mode because it occurs twice, unlike the other items that occur only once.

In [11]:
# Pure Python implementations code
x = [2, 3, 2, 8, 12]
mode_ = max((x.count(i), i) for i in set(x))[1]
print("Mode:", mode_)

Mode: 2


In [None]:
# We can obtain either single or multimode values using statistics library
x = [2, 3, 2, 8, 12]
y = [2, 3, 2, 8, 3, 12]
print("Single Mode value:", statistics.mode(x))
print("Multiple Mode value:", statistics.multimode(y))

In [None]:
multimode('aabbbbccddddeeffffgg')