# Examples for DS Lecture, Week 3
*Phong Nguyen, 10 Oct 2018*

## Binomial distribution

#### Task: Draw samples from a Binomial distribution and visualise them
- a straightforward way: use a library like [numpy](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.binomial.html)
- a fun way: simulate data by hands

In [None]:
n = 100
p = 0.2

First, simulate a biased coin with probability of Head as $p$.

In [None]:
from random import random

def flip_coin(p):
    'Return H or T with probability of H as p.'
    return 'H' if random() <= p else 'T'

In [None]:
flip_coin(p)

Test by flipping the coin 100 times.

In [None]:
from collections import Counter

coins = [flip_coin(p) for _ in range(100)]
print(coins)
print(Counter(coins))

Flip the coin $n$ times to simulate Binomial distributed data.

In [None]:
def run_experiment(n, p):
    'Flip the coins n times and return the number of heads.'
    coins = [flip_coin(p) for _ in range(n)]
    return Counter(coins)['H']

In [None]:
run_experiment(n, p)

The result of each experiment is a sample drawn from a Binomial distribution $B(n,p)$.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

def draw_barchart(data):
    counts = Counter(data)
    plt.bar(counts.keys(), counts.values(), width=0.9)
    plt.show()    

In [None]:
sample_size = 1000
data = [run_experiment(n, p) for _ in range(sample_size)] 
draw_barchart(data)

Just to double check with `numpy` library.

In [None]:
import numpy as np
data = np.random.binomial(n=n, p=p, size=sample_size)
draw_barchart(data)

## Q-Q Plot

#### Task: Draw a qq plot

First, generate 100 random numbers following the standard normal distribution.

In [None]:
data = np.random.normal(0, 1, 100)
data

In [None]:
sorted_data = sorted(data)
sorted_data

We need to find the theoretical quantiles of the normal distribution. We expect 100 random numbers that follow a normal distribution should be evenly distributed in the 'bell shape'. We split the distribution into 100 equal segments and find the z-score corresponding to each (middle of the) segment.

In [None]:
sampling_points = (np.arange(100) + 0.5) / 100
sampling_points

We use scipy to compute the z-score.

In [None]:
import scipy.stats as st
sampling_theoretical_quantiles = [st.norm.ppf(p) for p in sampling_points]
sampling_theoretical_quantiles

Now, just plot them in a scatter plot.

In [None]:
plt.scatter(sampling_theoretical_quantiles, sorted_data)
plt.plot([-2.5, 2.5], [-2.5, 2.5], 'r')
plt.show()

Again, double check with a library. This time, it is [statsmodel](https://www.statsmodels.org/dev/generated/statsmodels.graphics.gofplots.qqplot.html).

In [None]:
import statsmodels.api as sm

sm.qqplot(data, line='r')
plt.show()

A bonus third way, we draw our data against another empirical normal distribution, with very different parameters. And we still see the points following a straight line!

In [None]:
another_normal_data = np.random.normal(10, 101, 100)
plt.scatter(sorted(another_normal_data), sorted_data)
plt.show()