## Distributions

In [1]:
import numpy as np
from scipy import stats

In [4]:
myDf = stats.norm(5,3)

In [10]:
X = np.linspace(-5,15,101)
y = myDf.cdf(X)
print(y)

[  4.29060333e-04   5.44108652e-04   6.87137938e-04   8.64165209e-04
   1.08230048e-03   1.34989803e-03   1.67671823e-03   2.07409836e-03
   2.55513033e-03   3.13484226e-03   3.83038057e-03   4.66118802e-03
   5.64917276e-03   6.81886227e-03   8.19753592e-03   9.81532863e-03
   1.17052981e-02   1.39034475e-02   1.64486958e-02   1.93827871e-02
   2.27501319e-02   2.65975740e-02   3.09740757e-02   3.59303191e-02
   4.15182197e-02   4.77903523e-02   5.47992917e-02   6.25968728e-02
   7.12333774e-02   8.07566592e-02   9.12112197e-02   1.02637252e-01
   1.15069670e-01   1.28537149e-01   1.43061192e-01   1.58655254e-01
   1.75323945e-01   1.93062337e-01   2.11855399e-01   2.31677575e-01
   2.52492538e-01   2.74253118e-01   2.96901429e-01   3.20369191e-01
   3.44578258e-01   3.69441340e-01   3.94862910e-01   4.20740291e-01
   4.46964883e-01   4.73423536e-01   5.00000000e-01   5.26576464e-01
   5.53035117e-01   5.79259709e-01   6.05137090e-01   6.30558660e-01
   6.55421742e-01   6.79630809e-01

### Discrete Distributions
Two discrete distributions are frequently encountered: the ___binomial distribution___ and the ___Poisson distribution___.

- **Binomial distribution**- It has an inherent upper limit. Ex:- when you throw dice ﬁve times, each side can come up a maximum of *ﬁve times*(limited to five times).

- ** Poisson distribution**- It doesn't have an upper limit. Ex:- Number of people you know(there is no limit). 


## Bernoulli Distribution
The simplest case of a _univariate distribution_, and also the _basis of the binomial distribution_, is the Bernoulli distribution which has only *two states*.

Ex:- If we ﬂip a coin (and the coin is not rigged) ***once***, the chance that “heads” comes up is P(heads) = 0.5 And since it has to be heads or tails,we must have P(heads) + P(tails) = 1



In [14]:
p = 0.5
bernoulliDist = stats.bernoulli(p)

# since this is categorical, we must use "pmf"
p_tails = bernoulliDist.pmf(0)  # probability mass function
p_heads = bernoulliDist.pmf(1)

# for 10 bernoulli trails

trials = bernoulliDist.rvs(10) # random variate sample
print(trials)


[1 1 1 1 0 1 1 0 1 1]


## Binomial distribution
It has an inherent upper limit. the binomial distribution is associated with the question “Out of a given (ﬁxed) number of trials,how many will succeed?”.

Ex:- If we ﬂip a coin ___multiple times___, and ask “How often did heads come up?” then, we have the binomial distribution. Few other examples are

- Out of ten tosses, how many times will this coin land heads?
- From the children born in a given hospital on a given day, how many of them will be girls?
- How many students in a given classroom will have green eyes?
- How many mosquitoes, out of a swarm, will die when sprayed with insecticide?


## Poisson distribution

The poisson distribution is similar to binomial distribution. The difference is:-
- Binomail distribution looks at how many times we register the success over fixed no:of trials.
- Poisson distribution looks at how many times an event occurs over a given period of time.

## Normal or Gaussian distribution
It is the most important of all the distribution functions. The _standard normal distribution_ is a normal distribution with
a mean of zero and a standard deviation of one, and is sometimes referred to as *z-distribution*.

**Examples**
- If the average man is 175 cm tall with a standard deviation of 6 cm, what is the probability that a man selected at random will be 183 cm tall?
- If cans are assumed to have a standard deviation of 4 g, what does the average weight need to be in order to ensure that 99 % of all cans have a weight of at least 250 g?
- If the average man is 175 cm tall with a standard deviation of 6 cm, and the average woman is 168 cm tall with a standard deviation of 3 cm, what is the probability that a randomly selected man will be shorter than a randomly selected woman?

In [2]:
import numpy as np
from scipy import stats

mu = -2
sigma = 0.7
myDistribution = stats.norm(mu,sigma)
significance_level = 0.05    # probability (percentage is given)

# to find the value when percentage is given, use PPF
myDistribution.ppf([significance_level/2, 1-significance_level/2] )

array([-3.37197479, -0.62802521])

## Central Limit Theorem
**mean** of the distributed random samples will also be ___normally distributed___.

## Students t- distribution
When the *population mean* and *variance* are unknown, we use *t-distribution* to approximate *standard error(SE)* of the mean from sample data.

**t-ditribution** converges for larger values towards *normal distribution*.

    A very frequent application of the t-distribution is in the calculation of conﬁdence intervals for the mean. The width of the 95 %-conﬁdence interval (CI), i.e., the interval that contains the true mean with a chance of 95 %, is the same width about the population mean that contains 95 % of the sample means.
    
The following example shows how to calculate the t-values for the 95 %-CI, for n = 20. The lower end of the 95 % CI is the value that is larger than 2.5 % of the distribution; and the upper end of the 95 %-CI is the value that is larger than 97.5 % of the distribution. These values can be obtained either with the percentile point function (PPF), or with the inverse survival function (ISF)

In [27]:
import numpy as np
from scipy import stats
n = 20
df = n-1 #degree of freedom
alpha = 0.05 

# t-distribution value taking just 20 values
t_val = stats.t(df).isf(alpha/2)

# normal-distribution value for whole dataset
norm_val = stats.norm.isf(alpha/2)

print('t-distribution = {0}\nnormal-distribution = {1}'.format(t_val,norm_val))

# 95% confidence interval for mean 
data = np.arange(0,100)
new_alpha = 0.95
new_df = len(data)-1
ci = stats.t.interval(new_alpha, new_df,loc=np.mean(data), scale=stats.sem(data))
print('confidence interval = ',ci)

t-distribution = 2.0930240544082634
normal-distribution = 1.9599639845400545
confidence interval =  (43.743490583289677, 55.256509416710323)


we can clearly see that **t-distribution ~= normal-distribution**.

**Advantage**
- The t-distribution is much more roboust against outliers than normal distribution.

## Chi-Square Distribution
The chi-square distribution is related to the normal distribution in a simple way. If a random variable 'x' has *normal distribution*, then X^2 has *chi-square distribution*.

The sum of squares of 'n' *independent* and *standard normal random variables* have a **chi-square distribution** with *n degrees of freedom*.

**Application Example**

    A pill producer is ordered to deliver pills with a standard deviation of  D 0:05. From the next batch of pills n D 13 random samples have a weight of 3.04, 2.94, 3.01, 3.00, 2.94, 2.91, 3.02, 3.04, 3.09, 2.95, 2.99, 3.10, 3.02 g.
    
** Question** : Is the standard deviation of produced pills larger than allowed?

** Answer** : Since the chi-square distribution describes the distribution of the summed squares of random variates from a standard normal distribution,we have to normalize our data before we calculate the corresponding CDF-value
 

In [31]:
import numpy as np
from scipy import stats

data = np.r_[3.04, 2.94, 3.01, 3.00, 2.94, 2.91, 3.02,
3.04, 3.09, 2.95, 2.99, 3.10, 3.02]
sigma = 0.05

chi2Dist = stats.chi2(len(data)-1) # degree of freedom = n-1
statistic = sum(((data-np.mean(data))/sigma)**2)

#now we got the value, calculate probability using survival function
probability = chi2Dist.sf(statistic)

print(probability)

0.192933066543


**Interpretation**

    If the batch of pills is from a distribution with a standard deviation = 0.05, the likelihood of obtaining a chi-square value as large or larger than the one observed is about 19 %, so it is not atypical. In other words, the batch matches the expected standard deviation
    
"The chances of obtaining value larger than 0.05 is only **19%**, Hence the produced pills matches the expected standard deviation"

## F-Distribution
It is used in determining critical values in ANOVAs("ANalysis Of VAriance”). If we want to investigate whether two groups have the same variance, we have to calculate the ratio of the sample standard deviations squared.

** Application Example**

    Take for example the case where we want to compare the precision of two methods to measure eye movements. The two methods can have different accuracy and different precision. As shown in Fig. 6.17,the accuracy gives the deviation between the real and the measured value, while the precision is determined by the variance of the measurements. With the test we want to determine if the precision of the two methods is equivalent, or if one method is more precise than the other.
    
When you look 20 degree to the right, you get the following results:
- Method 1: [20.7, 20.3, 20.3, 20.3, 20.7, 19.9, 19.9, 19.9, 20.3, 20.3, 19.7, 20.3]
- Method 2: [ 19.7, 19.4, 20.1, 18.6, 18.8, 20.2, 18.7, 19. ]


In [37]:
import numpy as np
from scipy import stats

method1 = np.array([20.7, 20.3, 20.3, 20.3, 20.7, 19.9,
19.9, 19.9, 20.3, 20.3, 19.7, 20.3])
method2 = np.array([ 19.7, 19.4, 20.1, 18.6, 18.8, 20.2,
18.7, 19. ])


fval = np.var(method1, ddof=1)/np.var(method2, ddof=1)# degree of freedom = 1
fd = stats.f(len(method1)-1,len(method2)-1)
p_oneTail = fd.cdf(fval) # cdf of f-distribution

if (p_oneTail<0.025) or (p_oneTail>0.975):
    print('p-tail = {}, which is not between 2.5 to 97.5'.format(p_oneTail))
    print('\nHence,There is a significant difference'
' between the two distributions.')
else:
    print('No significant difference.')

p-tail = 0.018665169931411433, which is not between 2.5 to 97.5

Hence,There is a significant difference between the two distributions.


**Interpretation**

    The F statistic is F D 0:244, and has n 1 and m 1 degrees of freedom, where n and m are the number of recordings with each method. The code sample below shows that the F statistic is in the tail of the distribution (p_oneTail=0.019), so we reject the hypothesis that the two methods have the same precision.

## Other distributions

- **Lognormal distribution**: A normal distribution, plotted on an exponential scale. A logarithmic transformation of the data is often used to convert a strongly skewed distribution into a normal one.

- **Weibull distribution**: Mainly used for reliability or survival data.
- **Exponential distribution**: Exponential curves.
- **Uniform distribution**: When everything is equally likely.