## Binomial Distribution


#### 1. From [StatQuest](https://www.youtube.com/watch?v=J8jNoF-K8E8)

In [77]:
import pandas as pd
import numpy as np
# from math import comb #python needs updating

Probability that people equally like orange vs grape fanta.

Binomial distribution to model what to expect when there is no preference (difference) between two choices (outcomes).

* If there is no preference, there is a 50% chance of either choice.
* In a string of outcomes, the probability of observing the data is given by multiplying their individual probabilities

In [3]:
O = 'O'
G = 'G'
prob = {'O':0.5,'G':0.5}

In [4]:
obs = [O,O,G]
obs

['O', 'O', 'G']

In [10]:
prob_of_event = np.prod([prob[_] for _ in obs])
prob_of_event

0.125

`0.125` is the probability of asking 3 people their prefernce and them saying `O`,`O`, and `G` in that order.

It is **not** the probability that 2/3 people will prefer `O` over `G`.

The sequence of events could've been `[O,G,O]`, or `[G,O,O]`.
The probabilities of these events are `0.125` each, so the total probability that *2/3 prefer O over G* would be `0.125 x 3` or `0.375` 


In [47]:
import math

def binom(n, k):
    '''
    Number of ways to arrange k/n observations
    ie. if n=3 and k=1
    1. [1,0,0]
    2. [0,1,0]
    3. [0,0,1]
    
    '''
    return math.factorial(n) // math.factorial(k) // math.factorial(n - k)

## the first term is the number of ways a 
# result can be obtained in a sequencew
## The second weights the probability of success 
## The third term weights the probability of failure (1-p to the power of n-x, aka, failures)
def p_binom(x,n,p):
    return binom(n,x)  * (p**x) * ((1-p) **(n-x))

In [48]:
#prob of k (prob of success) given the data (x,p)
n = 3
x = 1
p = 0.5
p_binom(x,n,p)

0.375

### Developing the intution with simulations

In [95]:
def generateRandomSample(n = 3):
    ransample = [random.choices([O,G],[0.5,0.5]) for i in range(0,n)]
    ransample = sum(ransample,[])
    return ransample

In [60]:
### prob of success is defined as 1/3 prefering O

In [99]:
from collections import Counter
def is_success(ransample,success='G',x=1):
    return Counter(ransample)[success] == x

### What is the fraction of times the desired event is observed?

def get_prob_simulation(n=3,success='G',x=1,n_iters = 10000):
    count_success=0
    for i in range(0,n_iters):
        output = generateRandomSample(n)
        if is_success(output,success,x):
            count_success += 1
    return count_success/n_iters        

In [100]:
get_prob_simulation()

0.3678

In [101]:
### Repeat many times and get average
average_many_sims = np.average([get_prob_simulation(100) for _ in range(0,10)])
round(average_many_sims,3)

0.0

Result (`0.375+-.00n`) is very similar to the exact result from the binomial distribution calculation

If, from a sample of 7 people, 4 prefer O and 3 G. 

* Can we conclude that, in general, more people prefer `O` than `G`?
* What is the probability that someone at random out prefer `O`?

In [93]:
n = 7
x = 4
p = 0.5
p_binom(x,n,p)

0.2734375

`0.273` is the probability that someone at random would pick `O` over `G` if this observation were true.

Using simulations we get an approximation:

<code>
The probability that from 7 people, <em>O</em> is picked  by 4 if the probabilities of each are the same.
</code>

In [111]:
get_prob_simulation(7,'O',4,50000)

0.27626

Probability of this event or rarer (that 4 or more people prefer `O`) is the sum of thes events:

In [133]:
p_binom(4,7,0.5) +\
p_binom(5,7,0.5) +\
p_binom(6,7,0.5) +\
p_binom(7,7,0.5)

0.5

The sum of the opposite events (`G` is preferred) gives the total prbability of the observations.

In this case the total sums to 1

In [141]:
sum([p_binom(i,7,0.75) for i in range(4,7+1)]) +\
sum([p_binom(i,7,1-0.75) for i in range(4,7+1)])

1.0

In [148]:
sum([round(p_binom(i,7,0.05),3) for i in range(4,7+1)]) +\
sum([round(p_binom(i,7,1-0.05),3) for i in range(4,7+1)])

1.0

-- Something is weird about how StatQuest defined these sums to calculate the p-value, they will always be 1.