# Probability distributions

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

In [2]:
data = pd.read_csv('aust_athletes_data.csv')
data.head()

Unnamed: 0,rcc,wcc,hc,hg,ferr,bmi,ssf,pcBfat,lbm,ht,wt,sex,sport
0,3.96,7.5,37.5,12.3,60,20.56,109.1,19.75,63.32,195.9,78.9,f,B_Ball
1,4.41,8.3,38.2,12.7,68,20.67,102.8,21.3,58.55,189.7,74.4,f,B_Ball
2,4.14,5.0,36.4,11.6,21,21.86,104.6,19.88,55.36,177.8,69.1,f,B_Ball
3,4.11,5.3,37.3,12.6,69,21.88,126.4,23.66,57.18,185.0,74.9,f,B_Ball
4,4.45,6.8,41.5,14.0,29,18.96,80.3,17.64,53.2,184.6,64.6,f,B_Ball


In [6]:
from scipy.stats import norm
from scipy.stats import probplot
from scipy.stats import skew

In [7]:
# Checking Skew
print("Skewness of height parameter : ", skew(data["ht"]))
print("Skewness of weight parameter : ", skew(data["wt"]))

Skewness of height parameter :  -0.1993027980729341
Skewness of weight parameter :  0.24060527967495085


`Here, height is slightly skewed to the left whereas weight is slightly to the right.`

### Types of Probability distribution

`Based on the type of a random variable(discrete or continuous) there are two types of Probability distributions - Discrete and Continuous. We are going to discuss following probability distributions.`

- Discrete probability distribution

 1. Discrete uniform

 2. Binomial distribution

- Continuous probability distribution

 1. Continuous uniform

 2. Normal distribution

 3. Lognormal distribution

Now let's try to answer the below questions.

- What % of athelete has height <=165 cm?
- What % of athelete has height between 165 and 185 cm?
- What % of athelete has height >185 cm?

In [8]:
len(data[data['ht'] <= 165])/len(data)

0.0594059405940594

In [20]:
from scipy.stats import norm

In [19]:
#% of athelete has height <=165 cm
print(norm.cdf(165, loc = data["ht"].mean(), scale = data["ht"].std()))

0.06037998011893779


In [17]:
# % of athelete has height between 165 and 185 cm
print(norm.cdf(185, loc = data["ht"].mean(), scale = data["ht"].std()) - norm.cdf(165, loc = data["ht"].mean(), scale = data["ht"].std()))

0.6321230376276139


In [18]:
# % of athelete has height >185 cm
print( 1- norm.cdf(185, loc = data["ht"].mean(), scale = data["ht"].std()))

0.30749698225344835


### Normality test
Normality test are used to determine whether the data is normally distribution or not OR whether the sample data comes from normally distributed population or not.
There are various kind of graphical and numeric tests to determine this.

- Graphical tests

 1. Histogram/density plot

 2. Q-Q plot

- Numeric tests

 1. Shapiro-Wilk test

 2. Kolmogorov-Smironv test

### a. QQ Plot
It is a graphical method for comparing two probability distributions by plotting their quantiles against each other. For normality test one distribution is w.r.t. given sample that we want to test and the another distribution is the standard normal distribution. There are builtin methods available in statsmodels and scipy package to plot Q-Q plot. We can also plot it manually.

### b. Shapiro-Wilk test¶
It's a numeric test to check whether a sample is normally distributed or not. It is a hypothesis based test where null and alternate hypothesis is defined as below -

H0 (Null Hypothesis) - Sample is normally distributed

H1 (Alternate Hypothesis) - Sample is not normally distributed

This, if the p value obtained for the W statistic is less than significance level( α ) then null hypothesis is rejected On the other hand, if the p value is greater than  α  then we failed to reject null hypothesis.

In [22]:
from scipy.stats import shapiro
print(shapiro(data['ht'])[1])

0.21208299696445465


`So the p value is > 0.05 so we fail to reject the null hypothesis`

`Here for  α  = 0.05, obtained p value(0.2120) >  α  , so we failed to reject null hypothesis i.e height came from a normally distributed population.`

### c. Kolmogorov–Smirnov(K-S) test
K-S test provides a way to -

- check whether a sample is drawn from a reference probability distribution or not(one-sample K–S test)

- check whether two samples are drawn from the same distribution or not(two-sample K–S test). 

It is a hypothesis based test where null and alternate hypothesis for one-sample K–S test is defined as below -

- H0 (Null Hypothesis) - Sample follows the reference distribution

- H1 (Alternate Hypothesis) - Sample does not follow the reference distribution

In [25]:
from scipy.stats import kstest
print(kstest(data['ht'], 'norm', args = (data['ht'].mean(), data['ht'].std())))

KstestResult(statistic=0.04556908744140109, pvalue=0.7956300304012434)


`Here for  α  = 0.05, obtained p value(0.7958) >  α  , so we failed to reject null hypothesis i.e height follows normal distribution.`

### 3. Binomial distribution
Binomial distribution is a discrete probability distribution for obtaining exactly  k  successes out of  n  Bernoulli trails.

Characteristics of a Bernoulli trails -

Each trail has only two possible outcomes - success and failure.
Total number of trails are fixed.
Probability of success and failure remains same through out all the trails.
The trails are independent of each other.
Binomial distribution is a way of calculating the probability of  k  successes from  n  Bernoulli trails.

The PMF of a binomial random variate is given as

P(X=k)=(nk)pk(1−p)n−k
 
where p = probability of success and (1-p) = probability of failure

k = number of successes and (n-k) = number of failures

In [27]:
#From the underlying dataset, we can observe that only 12.37%(25/202) atheletes play Basketball, Now if we choose a random sample of 50 atheletes then

#What is the probability that exactly two atheletes play basketball?
from scipy.stats import binom
n= 50 #total number of fixed trials
p= 0.1237 #probability of success
print("probability that exactly two atheletes play basketball : ", binom.pmf(2,50,0.1237))


probability that exactly two atheletes play basketball :  0.033129161897993045


In [28]:
#What is the probability that at most 10 atheletes play basketball?
print("probability that exactly at most 10 atheletes play basketball : ", binom.cdf(10,50,0.1237))

probability that exactly at most 10 atheletes play basketball :  0.9605911166949369


In [29]:
#What is the probability that at least 20 atheletes play basketball?
print("probability that exactly at least 20 atheletes play basketball : ", np.round((1- binom.cdf(20,50,0.1237)) + binom.pmf(20,50,0.1237)))

probability that exactly at least 20 atheletes play basketball :  0.0


In [None]:
#Since all the above questions hold a varying number of successes(2,10,20) from fixed number of trails(50) with p = 0.1237 so binomial distribution can be used to answer these questions.

### 4. Uniform distributions
Based on the type of random variable there are two types of uniform distributions.

- Discrete uniform distribution for discrete random variable
- Continuous uniform distribution for continuous random variable

### 5.  LogNormal distribution
A random variable X is said to be lognormally distributed if natural logarithm of X is normally distributed. In other words X ∽LogNormal ( μ , σ ) if  log(X)  is normally distributed.

PDF of a log normally distributed random variable is given as