<a href="https://colab.research.google.com/github/tomanizer/stats_in_10_minutes/blob/master/Binomial_Distribution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Binomial distribution

The binomial distribution is a *discrete probability distribution*.
The sample space consists of only two outcomes often describes as success/fail, win/lose, yes/no or 0/1.

A single success/fail-experiment (such as a single coin toss) is called a *Bernoulli trial*. And a sequence of the outcomes of such trials is called a *Bernoulli process*.

By convention the success-probability is denoted by $p$ and the number of trials as $n$.

A *sampling distribution* describes the probability distribution of a given random-sample-based statistic.

In our case we the statistic we are interested in is the number of successes in a number of trials.

A Binomial (sampling) distribution, describes the probability distribution of a successes when taking infinite samples of a Bernoulli process with a given success-probability $p$ and $n$ trials in each sample.

A *Bernoulli* dsitribution is a special case of the *Binomial distribution* where there is only one trial per process, hence $n=1$.

We implement each element of the *Binomial distribution* step by step in Python.

### Bernoulli Trial

In [0]:
from random import random

def bernoulli_trial(p):
  """A single Bernoulli experiment resulting in an outcome of True or False.

  Parameters:
  p (float): probability of True

  Return:
  bool: success or fail 
  """
  x = random() # python returns a random number between 0 and 1
  if x <= p:
    return True # success
  else:
    return False # fail

Perform one *Bernoulli trial* for a coin toss:

In [110]:
p = 0.5

bernoulli_trial(p=p)

True

### Bernoulli Process

In [0]:
def bernoulli_process(n, p):
  """A series of Bernoulli trials

  Parameters:
  n (int): number of experiments
  p (float): probability of success

  Returns:
  list: series of success and fails
  """
  return [bernoulli_trial(p=p) for i in range(n)]

Create a *Bernoulli process* with 10 trials.

In [164]:
n = 10

bernoulli_process(n=n, p=p)

[True, True, False, False, False, False, True, True, False, True]

### Binomial Sample Distribution


In [0]:
def binomial_sampling(samples, n, p):
  """Perform a number of trials with n experiments with success probability p.

  Parameters:
  samples (int): number of samples
  n (int): number of experiments in each sample
  p (float): success probability

  Returns:
  list of lists: samples of bernoulli processes
  """
  return [bernoulli_process(n=n, p=p) for i in range(samples)]

Take 15 samples of n trials with a success probability p.

In [175]:
samples = 15

simulation1 = binomial_sampling(samples=samples, p=p, n=n)
simulation1

[[True, False, True, True, True, False, True, False, True, True],
 [True, True, True, True, True, False, True, False, True, True],
 [False, False, True, True, True, True, True, True, True, False],
 [True, False, False, False, True, True, False, True, True, True],
 [False, True, True, False, True, False, True, False, False, True],
 [False, True, True, True, False, False, False, True, False, True],
 [True, True, False, False, False, True, True, True, False, True],
 [True, True, True, False, True, True, False, False, False, True],
 [True, True, False, False, False, True, False, False, True, False],
 [False, True, True, True, True, True, False, False, False, True],
 [True, True, False, False, True, True, False, False, False, True],
 [True, True, False, True, True, True, True, False, True, False],
 [False, False, True, False, True, True, True, False, False, False],
 [False, False, False, True, False, True, False, False, True, False],
 [True, True, False, True, False, False, True, True, True

### Histograms of the samples

Plot a summary of the samples as a histogram

We count the number of successes in each sample and sort the number of successes per sample from low to high.

In [176]:
x = sorted([sum(sim) for sim in simulation1])
x

[3, 4, 4, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8]

(The histnorm parameter normalises the counts to a  between 0 and 1 so that we can easily compare the results from a different number of trials)

In [177]:
import plotly.graph_objects as go
fig = go.Figure()
fig.add_histogram(x=x, name=f"Trials={samples}", histnorm="probability density")

Let's increase the number of samples for a second experiment.

In [0]:
samples2 = 200
simulation2 = binomial_sampling(samples=samples2, n=n, p=p)

In [179]:
x2 = sorted([sum(sim) for sim in simulation2])
fig.add_histogram(x=x2, name=f"Trials={trials2}", histnorm="probability density")

And increase a lot for a third.

In [0]:
samples3 = 1000
simulation3 = binomial_sampling(samples=samples3, n=n, p=p)

In [181]:
x3 = sorted([sum(sim) for sim in simulation3])
fig.add_histogram(x=x3, name=f"Trials={samples3}", histnorm="probability density")

As we increase the number of trials our simulated binomial distribution begins so resemble the true binomial sampling distribution.

We create the true numbers (where the number of trials approaches infinity) using the scipy-implementation of the binomial distribution. We are plotting the corresponding *probability mass function*.

In [0]:
from scipy.stats import binom

domain = list(range(n))
y_true = binom.pmf(domain, n, p)

In [183]:
fig.add_scatter(x=domain, y=x_true, name="True Binomial")

### Formal description

If a random variable $X$ is follows a binomial disrtibution we write $X \sim B(n,p)$.

That is short for: The random variable $X$ behaves similarly to a Binomial distribution with a $n$ experiments per trial and a sucess probability of $p$.

### Binomial Probability Mass Function

The *probability mass function* returns the probability of getting $k$ successes in $n$ independent bernoulli experiments.

The values above returned by the scipy implementation of the binomial.pmf are:

In [184]:
y_true

array([0.00097656, 0.00976563, 0.04394531, 0.1171875 , 0.20507813,
       0.24609375, 0.20507813, 0.1171875 , 0.04394531, 0.00976563])

In [185]:
list(zip(domain, y_true))

[(0, 0.0009765625),
 (1, 0.00976562500000001),
 (2, 0.04394531249999999),
 (3, 0.11718750000000014),
 (4, 0.20507812500000022),
 (5, 0.24609375000000025),
 (6, 0.20507812500000022),
 (7, 0.11718750000000014),
 (8, 0.04394531249999999),
 (9, 0.00976562500000001)]

So the probability to obtain exactly 3 successes in 10 trials is approx. 0.117.

The pmf can be derived as follows:

- The probability of $k$ successes is $p^k$.
- The probability of $n-k$ failures is $(1-p)^{n-k}$.
- There are ${n\choose k}$ possibilities to distribute k sucesses in a series of n experiments.

Putting all together yields the *binomial probability mass function*:

$f(k, n, p) = Pr(X = k) = {n\choose k} p^k(1-p)^{n-k}$

#### Implementing the probability mass function

We can implement the probability mass function ourselves.

In [0]:
from math import factorial

def binomial_coefficient(n, k):
  return factorial(n) / (factorial(k) * factorial(n-k))

def binomial_pmf(k, n, p):
  return binomial_coefficient(n=n, k=k) * p**k * (1-p)**(n-k)


Veryfing the scipy result for 3 successes (k=3) using our own implementation.

In [188]:
binomial_pmf(n=n, k=3, p=p)

0.1171875

### The cumulative distribution function for the binomial distribution

The cumulative distribution function returns the probability of *up to k* successes. Hence it is simply summing up the results of the *probability mass function* up to a value of $k$.

$Pr(X \leq k) = \sum\limits_{i=0}^k {n\choose i} p^i(1-p)^{n-i}$



In [0]:
def binomial_cdf(k, n, p):
  return sum([binomial_pmf(k=i, n=n, p=p) for i in range(k+1)]) #+1 because in python range does not include the endpoint

### Properties of the Binomial Distribution

The expected value of $X$ is simply $np$.

This results from the *Linearity of expectation*:
`The expected value of the sum of random variables is equal to the sum of their individual expected values.`

In our example with
$p$ = 0.5 and $n$ = 10, $E[X] = np = 5$

How does this compare to our sample means?

In [199]:
from statistics import mean
mean(x)

5.733333333333333

In [196]:
mean(x2)

4.975

In [197]:
mean(x3)

4.955

In [198]:
expected = n*p
expected

5.0

We see that the more samples we take, the closer the mean of the samples gets to the expected value.