<a href="https://colab.research.google.com/github/vvtrip/ml_manifestations/blob/master/stats/4_stats_with_python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import numpy as np
from scipy import stats
import statistics

# Basics

**Descriptive Statistics** 
 - it summaries a dataset, which helps in gaining insights, and making inferences about a dataset.
 - In general, we compute statistical measures of one or more samples, related to a population, and draw conclusions about the population.
 - Descriptive Statistics involves estimating centrality measures and measures of dispersion.

**Centrality Measures**
 - It determine the center of a dataset.
 - The three major centrality measures are: mean, median, and mode.

**Mean**
 - it is the sum of all values divided by a total number of values, of a data set.
 - mean function of numpy is used to compute mean of an array of numbers

**Median**
 - it is the value that separates the given data set into two halves.
 - median function of numpy library can be used to compute median of a data set.

**Mode**
 - it is the value that appears most often in a dataset is called the mode value.
 - mode function of scipy.stats module can be used for computing mode of a given data set.

**Measures of Dispersion**
 - it provides insights on the spread of given dataset.
 - Major measures of dispersion are: range, percentile, inter-quartile range, standard deviation, variance, skewness, and kurtosis.


**Range**
 - it is the difference between maximum and minimum values of the dataset.
- import numpy as np
- s1 = np.array([86, 47, 45, 47, 40])
- print(np.ptp(s1))

**Percentile**
 - it refers to a value, below which lies given the percentage of data points.

 - E.g., 45th percentile refers to a value below which 45% of data points are found.
 - percentile function of numpy can be used to compute a single or multiple percentiles.

**Quartiles**
 - Three Quartiles namely, Q1 Q2 and Q3, split the entire dataset into four equal parts.
 - Each part contains 25% of data.

**Inter Quartile Range (IQR)**
 - it refers to difference between third quartile (Q3) and first quartile (Q1).
 - iqr method from scipy.stats can be used for calculating it.

**Variance**
 - it is defined as the average of squared differences, of each data point from dataset's mean.

**Standard Deviation**
 - it is square root of variance.
 - var and std functions of numpy can be used for computing variance and standard deviation respectively.
 - By default, the functions assume that the dataset represents entire population.
-  To represent a sample, derived from a population, ddof parameter(delta degree of freedom is set to 1.

**Skewness**
- it determines whether the majority of data points are present on one side of the distribution.
 - A positive value represents right skewed distribution; a negative value represents left skewed one, zero represent unskewed distribution.

**Kurtosis**
 - Kurtosis indicates how much of data is concentrated around mean or shape of the probability distribution.
 - It can be estimated using kurtosis function of scipy.stats module.
 - By default, it uses Fisher’s definition. This can be changed to Pearson by setting fisher parameter to False.

In [2]:
s = [26, 15, 8, 44, 26, 13, 38, 24, 17, 29]
print(f"Mean is {np.mean(s)}")
print(f"Meadian is {np.median(s)}")
print(f"Mode is {stats.mode(s)}, {stats.mode(s)[0][0]}")
print(f"without interpolation 1st and 3rd quartiles are {np.percentile(s,[25,75])}")
print(f"without interpolation IQR through scipy.stats is {stats.iqr(s, rng=(25,75))}")
print(f"with interpolation 1st and 3rd quartiles are {np.percentile(s,[25,75], interpolation='lower')}")
print(f"with interpolation IQR through scipy.stats is {stats.iqr(s, rng=(25,75), interpolation='lower')}")
print(f"Skewness is {round(stats.skew(s),2)}")
print(f"Kurtosis is {round(stats.kurtosis(s),2)}")

Mean is 24.0
Meadian is 25.0
Mode is ModeResult(mode=array([26]), count=array([2])), 26
without interpolation 1st and 3rd quartiles are [15.5  28.25]
without interpolation IQR through scipy.stats is 12.75
with interpolation 1st and 3rd quartiles are [15 26]
with interpolation IQR through scipy.stats is 11
Skewness is 0.36
Kurtosis is -0.77


# Random Variables, Random Numbers and Random Distributions

A random number is a number, chosen by chance from a distribution.
Python provides a lot of modules, which deal with random numbers.
- random module of python standard library.
- random module of Numpy
- stats module of Scipy

random module of numpy has utilities, which generate arrays of random numbers.
- E.g.: rand function generates uniformly distributed numbers from range [0, 1]

- rand function with no arguments generate a single random value.

- By passing arguments, it generates a random array of specified size.

In [3]:
import numpy as np
print(np.random.rand())
# generates a 2*3 array
print(np.random.rand(2,3))

0.1534063485247239
[[0.08962388 0.04007866 0.04149615]
 [0.6763176  0.65476664 0.90561844]]


- In statistics, you select items randomly from a population, either with or without a replacement.
- This can be achieved with choice method

In [4]:
import numpy as np
print(np.random.choice([11, 22, 33], 2, replace=False))

[22 11]


**Random Seeding**
- Seed is an important concept when it comes to reproducibility. If you are working with random numbers and you would want to peers to validate your results, i.e., they should also get the same random sequence as you did, you can set the seed to a particular value and send the seed value to your peers.
- seed is a number that sets the initial state of random number generator.
- Setting a seed, helps in generating the same sequence of random numbers, repeatedly.
- seed method of a random module can be used to set a seed as shown in below example.

In [5]:
import numpy as np

np.random.seed(100)
print(np.random.rand())

np.random.seed(100)
print(np.random.rand())

0.5434049417909654
0.5434049417909654


**Random Variables**
 - In probability theory, the set of all possible outcomes of a random experiment is known as sample space.
 - Probabilities of all outcomes of the experiment define the probability distribution.
 - A random variable is a variable that takes real numbers or integers and map each value to one of the outcomes of sample space. 
 - E.g.: In an experiment of tossing a coin, the sample space is {'Head', 'Tail'} and a possible random variable takes the value 0 for head and 1 for the tail.

**Probability Distributions**
 - There are two types of probability distributions namely discrete and continuous that take integer and real values, respectively.
 - scipy.stats module provides classes that represent random variables, corresponding to a large number of probability distributions.
 - E.g: the class norm represent normal continuous random variable, and binom represent binomial discrete random variable.

- scipy.stats module provide a lot of methods for created discrete and continuous random variables.
- Commonly used methods are :
- pdf / pmf : Probability distribution function (continuous) or probability mass function (discrete).
- cdf : Cumulative distribution function.
- sf : Survival function (1 – cdf).
- rvs : Creating random samples from a distribution.
- The following example defines a normal continuous random variable of mean 1.0 and std 2.5.
- It also estimates probabilities and cumulative probabilities at -1, 0 and 1.
- The example also generates six random numbers from defined normal distribution.

In [6]:
from scipy import stats

x = stats.norm(loc=1.0, scale=2.5)

print(x.pdf([-1, 0, 1]))

print(x.cdf([-1, 0, 1]))

print(x.rvs((2,3)))

[0.11587662 0.14730806 0.15957691]
[0.2118554  0.34457826 0.5       ]
[[-0.40409689 -3.12269376  1.88668613]
 [-0.96516083  0.42031951  1.5199392 ]]


In [7]:
# create a normal distribution with mean 32 and standard deviation 4.5
# set the random seed to 1, and create a random sample of 100 elements from the above distribution
# compute aboslute difference between the sample mean and distribution mean

import numpy as np
from scipy import stats
np.random.seed(1)

x = stats.norm.rvs(loc=32, scale=4.5, size=100)
print(round(abs(np.mean(x)-32),2))

0.27


In [8]:
# Simulate a random experiment of tossing a coin 10000 times and determine the count of Heads.
# Hint: Define a binomial distribution with n = 1 and p = 0.5

import numpy as np
from scipy import stats
np.random.seed(1)

x = stats.binom(1, 0.5)
y = x.rvs(10000)
print(np.bincount(y)[0])

4990


# Hypothesis testing with scipy