# Random sampling and Sampling distributions

## Random sampling 

* Random sampling is selection of $n$ samples from a population of size $N$
* The population has its own parameters like $\mu$, $\sigma$. 
* Statistics of the sample
    * Sample mean $\bar{x}$ $$\bar{x} = \frac{\sum x_i}{n} $$
    * Sample Variance $s^2$ 
    
        * Formula $$s^2 = \frac{1}{n-1} \sum(x_i-\bar{x})^2$$ 
        * Coputationally efficient formula $$s^2 = \frac{1}{n-1} \Big (\sum x_i^2- \frac{\big(\sum x_i\big)^2}{n}\Big)$$
    
    * Sample median 
        1. Sort the values 
        2. Select the middle value (if n is even median is the average of the two middle values)
    * Sample Range $$max - min$$
    * Sample mode: The most frequent value(s)


### Examples

#### Example 1

The January 1990 issue of Arizona Trend contains a
supplement describing the 12 “best” golf courses in the state.
The yardages (lengths) of these courses are as follows: 6981,
7099, 6930, 6992, 7518, 7100, 6935, 7518, 7013, 6800, 7041,
and 6890. Calculate the sample mean and sample standard
deviation, sample median, sample range and the mode. 
Construct a dot diagram of the data.

**Solution**

|$$x$$ | $$x-\bar{x}$$|$$(x-\bar{x})^2$$| 
|--|---|---|
|6981|-87.08  |7583.51  |
|7099|30.92  |955.84  |
|6930|-138.08  |19067.01  |
|6992|-76.08  |5788.67  |
|7518|449.92  |202425.01  |
|7100|31.92  |1018.67  |
|6935|-133.08  |17711.17  |
|7518|449.92  |202425.01  |
|7013|-55.08  |3034.17  |
|6800|-268.08  |71868.67  |
|7041|-27.08  |733.51  |
|6890|-178.08  |31713.67  |
|$$\bar{x} = \frac{\sum x}{n} = 7068.08$$| |$$s = \sqrt{\frac{\sum(x-\bar{x})^2}{n-1}} = 216.86$$|

So

* the mean

$$\bar{x} = 7068.08$$ 

* std deviation

$$s = 216.86$$


* The median

6800, 6890, 6930, 6935, 6981, **6992, 7013**, 7041, 7099, 7100, 7518, 7518

$$median = \frac{6992 + 7013}{2} = 7002.5$$

* The range 
$$range = 7518 - 6800 = 718$$

* The mode

$$mode = 7518$$

* The dot graph

![](images/dot_graph.png)

As we see, the mean is influenced by the outliers and the median is robust to outliers

* How to use calculator to get the mean and std deviation [here](https://www.youtube.com/watch?v=AD_e7qW_Qq0)

In [1]:
from IPython.display import IFrame
IFrame(width=560, height=315, src="https://www.youtube.com/embed/AD_e7qW_Qq0")

#### Example 2

An article in the Journal of Structural Engineering
(Vol. 115, 1989) describes an experiment to test the yield strength
of circular tubes with caps welded to the ends. The first yields (in
kN) are 96, 96, 102, 102, 102, 104, 104, 108, 126, 126, 128, 128,
140, 156, 160, 160, 164, and 170. Calculate the sample mean and sample standard
deviation, sample median, sample range and the mode. 
Construct a dot diagram of the data.

**Solve it yourself**

## Sampling distributions

### Distribution of sample mean $\bar{x}$

* While repeating the sampling process the sample mean $\bar{x}$ changes 
* So It's a random variable and has its own distribution
* **Central limit theory**

    It states that the distribution of the sample mean $\bar{x}$ is a normal distribution with mean (mean of sample means) 
    $$\mu_{\bar{x}} = \mu$$ 
    and std deviation $$\sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}$$ 
    
    Where $\mu$ and $\sigma$ are the parameters of the population.
    
    We can transform it to std normal distribution: 
    $$Z = \frac{\bar{x} - \mu}{\frac{\sigma}{\sqrt{n}}}$$
    

* A good visualization of central limit theory [here](https://seeing-theory.brown.edu/probability-distributions/index.html).

In [2]:
IFrame(width=1024, height=512, src="https://seeing-theory.brown.edu/probability-distributions/index.html")

### Examples

#### Example 3

A synthetic fiber used in manufacturing carpet has
tensile strength that is normally distributed with mean 75.5 psi
and standard deviation 3.5 psi. Find the probability that a random sample of n = 49 fiber specimens will have sample mean
tensile strength that exceeds 75.75 psi.

**Solution**

Given $n=49$, $\mu = 75.5$, and $\sigma = 3.5$

required is $P(\bar{x} > 75.75)$

According to central limit theory $\bar{x}$ is a normal distribution $N(\mu, \frac{\sigma}{\sqrt{n}})$

$P(\bar{x} > 75.75) \rightarrow P(Z > \frac{75.75 - \mu}{\frac{\sigma}{\sqrt{n}}}) $

$P(Z > \frac{75.75 - 75.5}{\frac{3.5}{\sqrt{49}}}) =  P(Z > 0.5) = P(Z < -0.5) = 0.3085$ 

from $Z$ table

#### Example 4

The compressive strength of concrete is normally distributed with $\mu=2500$ psi and $\sigma = 50$ psi. Find the probability that a random sample of $n = 5$ specimens will have a sample
mean diameter that falls in the interval from 2499 psi to 2510 psi.

**Solution**

Given $n=5$, $\mu = 2500$, and $\sigma = 50$

required is $P(2499 < \bar{x} < 2510)$

According to central limit theory $\bar{x}$ is a normal distribution $N(\mu, \frac{\sigma}{\sqrt{n}})$

$P(2499 < \bar{x} < 2510)$

$P(-0.0447 < Z  < 0.447) = 0.672 - 0.482  = 0.19$

from $Z$ table

### Python helper function

In [7]:
# To get the values from z table 
import scipy.stats as st

# 1. Given z we can get the probability using st.norm.cdf(z)
# Example
z = -0.045
p = st.norm.cdf(z)
print("p = %.3f"%p)

# 2. Given p we can get the z value using st.norm.ppf(p)
p = 0.482
z = st.norm.ppf(p)
print("z = %.3f"%z)

p = 0.482
z = -0.045


# Confidence Intervals

In the previous part we learned how to guess the probability of getting a specific value of the sample mean $\bar{x}$ given the population mean $\mu$, std deviation $\sigma$ and using the central limit theory. 

In the real life, the parameters of the population ($\mu$, $\sigma$) are usually unkonwn because it's almost impossible to have all observations of the population to calculate the parameters. So it makes sense to have a sample and use the sample statistics to guess the population parameters with a degree of confidence (probability).

According to the central limit theory: 

$$P\Big(-Z_{\frac{\alpha}{2}}<\frac{\bar{x}-\mu}{\frac{\sigma}{\sqrt{n}}} < Z_{\frac{\alpha}{2}}\Big) = 1-\alpha$$

Where $1-\alpha$ is the confidence interval $CI$

Rearranging the equation we have 

$$\bar{x}-Z_{\frac{\alpha}{2}} \times \frac{\sigma}{\sqrt{n}} < \mu  < \bar{x}+ Z_{\frac{\alpha}{2}}  \times   \frac{\sigma}{\sqrt{n}} $$

In other words we can guess the range of population mean $\mu$ from the sample mean $\bar{x}$ and be confident with a specific probability value $CI$

$$\mu = \bar{x} \pm E$$

Where $E$ is the estimation error and 

$$E = Z_{\frac{\alpha}{2}} \times \frac{\sigma}{\sqrt{n}}$$

The interval length/range $(max-min)$ $\rightarrow$ $2E$

We can also get the sample size using 

$$n = \Big(\frac{Z_{\frac{\alpha}{2}} \times \sigma}{E}\Big)^2$$

### Student T table

* It is an approximation of the std normal distribution table
* We use it when the sample size $n$ is very low and/or the $\sigma$ of population is unkonwn
* The value $t_{\frac{\alpha}{2}, n-1}$ is an approximation of $Z_{\frac{\alpha}{2}}$
* Therefore the range of $\mu$ is 

$$\mu = \bar{x} \pm \Big( t_{\frac{\alpha}{2}, n-1} \times \frac{s}{\sqrt{n}}\Big)$$

where $s$ is the std deviation of the sample

* A good visualization of the difference between t and Z distributions [here](https://rpsychologist.com/d3/tdist/)

In [4]:
IFrame(width=1024, height=512, src="https://rpsychologist.com/d3/tdist/")

### A large sample confidence interval for a population proportion 

* Approximation of a binomial RV to a normal distribution 

$$Z = \frac{x-np}{\sqrt{np(1-p)}}$$

* The sample proportion $\hat{P} = \frac{x}{n}$
* The interval of the population proportion $P$ with a confidence interval $CI = 1-\alpha$ : 

$$\hat{P}-Z_{\frac{\alpha}{2}} \times \sqrt{\frac{\hat{P}(1-\hat{P})}{n}} < P  < \hat{P}+Z_{\frac{\alpha}{2}} \times \sqrt{\frac{\hat{P}(1-\hat{P})}{n}}$$

In other words 

$$P = \hat{P} \pm E$$ 

and 

$$E = Z_{\frac{\alpha}{2}} \times \sqrt{\frac{\hat{P}(1-\hat{P})}{n}}$$


We can also get the sample size using 

$$n = \Big(\frac{Z_{\frac{\alpha}{2}}}{E}\Big)^2 \times  \hat{P}(1-\hat{P})$$

Note:

We only use $Z$ table here. 

### Examples

#### Example 5
A civil engineer is analyzing the compressive
strength of concrete. Compressive strength is normally distributed with $\sigma^2 = 1000(psi)^2$ . A random sample of 12 specimens
has a mean compressive strength of x = 3250 psi.

(a) Construct a 95% confidence interval on mean
compressive strength.

(b) Construct a 99% confidence interval on mean compressive strength. Compare the width of this confidence interval with the width of the one found in part (a).

**Solution**

Population sigma is know (Z table) 

(a)

given $n = 12$, $\sigma=\sqrt{1000}$, $\bar{x} = 3250$, $CI = 0.95$ so $\alpha=0.05$ and $\frac{\alpha}{2} = 0.025$

So 

$$P\Big(-Z_{\frac{\alpha}{2}}<\frac{3250-\mu}{\frac{\sqrt{1000}}{\sqrt{12}}} < Z_{\frac{\alpha}{2}}\Big) = 0.95$$

From $Z$ table $Z_{\frac{\alpha}{2}}=1.96$

So: 

$$-1.96<\frac{3250-\mu}{\frac{\sqrt{1000}}{\sqrt{12}}} < 1.96$$

and 


$$3250-1.96\times \frac{\sqrt{1000}}{\sqrt{12}} < \mu < 3250+1.96\times \frac{\sqrt{1000}}{\sqrt{12}} $$

So 
$$3232.108 < \mu < 3267.89$$

(b)

$CI = 0.99$ so $\frac{\alpha}{2}=0.005$ and $Z_{\frac{\alpha}{2}}=2.575$

$$3250-2.575\times \frac{\sqrt{1000}}{\sqrt{12}} < \mu < 3250+2.575\times \frac{\sqrt{1000}}{\sqrt{12}} $$

So 

$$3229.9 < \mu < 3270.09$$

Width of part b is greater than part a 

#### Example 6

Dairy cows at large commercial farms often receive
injections of bST (Bovine Somatotropin), a hormone used to
spur milk production. Bauman et al. (Journal of Dairy Science,
1989) reported that 12 cows given bST produced an average of
28.0 kg/d of milk. Assume that the standard deviation of milk
production is 2.25 kg/d.

(a) Find a 99% confidence interval for the true mean milk
production.

(b) If the farms want the confidence interval to be no wider than
±1.25 kg/d, what level of confidence would they need to use?

**Solution**

Population sigma is know (Z table) 

(a)

given $n = 12$, $\sigma=2.25$, $\bar{x} = 28$, $CI = 0.99$ so  $\frac{\alpha}{2} = 0.005$ and $Z_{\frac{\alpha}{2}}=2.575$

So 

$$28-2.575\times \frac{2.25}{\sqrt{12}} < \mu < 28+2.575\times \frac{2.25}{\sqrt{12}} $$

So 

$$26.33 < \mu < 29.67$$

(b)

$$Z_{\frac{\alpha}{2}} \times \frac{2.25}{\sqrt{12}} < 1.25$$

and so:

$$Z_{\frac{\alpha}{2}} < 1.25 \times \frac{\sqrt{12}}{2.25}$$
$$Z_{\frac{\alpha}{2}} < 1.924$$

and so 

$$CI < P(-1.924< Z < 1.924) \rightarrow 0.973 - 0.027 < 0.946 $$

So the CI must be less than 94.6%.

#### Example 7

A confidence interval estimate is desired for the gain
in a circuit on a semiconductor device. Assume that gain is normally distributed with standard deviation $\sigma = 20$.

(a) Find a 95% CI for m when n = 10 and $\bar{x} = 1000$ .

(b) Find a 95% CI for m when n = 25 and $\bar{x} = 1000$ .

(c) How does the length of the CIs computed change with the
changes in sample size?

(d) How large must n be if the length of the 95% CI is to be 40?

**Solution**

Population sigma is know (Z table) 

(a)

given $n = 10$, $\sigma=20$, $\bar{x} = 1000$, $CI = 0.95$ so  $\frac{\alpha}{2} = 0.025$ and $Z_{\frac{\alpha}{2}}=1.96$

So 


$$987.6 < \mu < 1012.4$$

(b) $n = 25$



So 


$$992.2 < \mu < 1007.8$$

(c)

length of (a) = $1012.4-987.6= 24.8$

length of (b) = $1007.8-992.2= 15.6$

The length decreases while the sample size increases.

(d)

$$Length \rightarrow 2\times1.96 \times \frac{20}{\sqrt{n}} < 40$$

So 

$n = 3.84$ or $n = 4$ 

#### Example 8

A research engineer for a tire manufacturer is investigating tire life for a new rubber compound and has built 16 tires and tested them to end-of-life in a road test. The sample mean and
standard deviation are 60139.7 and 3645.94 kilometers. Find a
95% confidence interval on mean tire life.

**Solution**

Population sigma unkown (t-table)

Given $n = 16$, $\bar{x} = 60139.7$, $s = 3645.94$ and $CI = 0.95$

from t table $t_{\frac{\alpha}{2}, n-1} \rightarrow t_{0.025, 15} = 2.131$

$$ \bar{x} - \Big( t_{\frac{\alpha}{2}, n-1} \times \frac{s}{\sqrt{n}}\Big) < \mu < \bar{x} + \Big( t_{\frac{\alpha}{2}, n-1} \times \frac{s}{\sqrt{n}}\Big)$$

$$ 60139.7 - \Big( 2.131 \times \frac{3645.94}{\sqrt{16}}\Big) < \mu <  60139.7 + \Big( 2.131 \times \frac{3645.94}{\sqrt{16}}\Big)$$

So

$$ 58197.3 < \mu <  62082.1$$

#### Example 9

The fraction of defective integrated circuits produced
in a photolithography process is being studied. A random sample
of 300 circuits is tested, revealing 13 defectives. Calculate a 95% CI on the fraction of defective
circuits produced by this particular tool.

**Solution**

given $n = 300$, defectives = 13,$CI = 0.95$ so  $\frac{\alpha}{2} = 0.025$ and $Z_{\frac{\alpha}{2}}=1.96$

$\hat{P} = \frac{13}{300}$

The proportion in the population is 

$$\hat{P}-Z_{\frac{\alpha}{2}} \times \sqrt{\frac{\hat{P}(1-\hat{P})}{n}} < P  < \hat{P}+Z_{\frac{\alpha}{2}} \times \sqrt{\frac{\hat{P}(1-\hat{P})}{n}}$$

So 

$$\frac{13}{300}- 1.96 \times \sqrt{\frac{\frac{13}{300}(1-\frac{13}{300})}{300}} < P  < \frac{13}{300} + 1.96 \times \sqrt{\frac{\frac{13}{300}(1-\frac{13}{300})}{300}}$$

and 

$$0.0203< P < 0.0664$$

#### Example 10

Of 1000 randomly selected cases of lung cancer, 823 resulted in death within 10 years.

(a) Calculate a 95% confidence interval on the death rate from lung cancer.

(b) Using the point estimate of p obtained from the preliminary
sample, what sample size is needed to be 95% confident that
the error in estimating the true value of p is less than 0.03?


**Solution**

(a)

given $n = 1000$, death = 823,$CI = 0.95$ so  $\frac{\alpha}{2} = 0.025$ and $Z_{\frac{\alpha}{2}}=1.96$

$\hat{P} = 0.823$

The proportion in the population is 

$$\hat{P}-Z_{\frac{\alpha}{2}} \times \sqrt{\frac{\hat{P}(1-\hat{P})}{n}} < P  < \hat{P}+Z_{\frac{\alpha}{2}} \times \sqrt{\frac{\hat{P}(1-\hat{P})}{n}}$$

So 

$$0.823- 1.96 \times \sqrt{\frac{0.823(1-0.823)}{1000}} < P  < 0.823 + 1.96 \times \sqrt{\frac{0.823(1-0.823)}{1000}}$$

and 

$$0.799 < P < 0.846$$

(b) $E < 0.03$

$$n = \Big(\frac{Z_{\frac{\alpha}{2}}}{E}\Big)^2 \times  \hat{P}(1-\hat{P})$$

$$n = \Big(\frac{1.96}{0.03}\Big)^2 \times  0.823(1-0.823)$$

$n = 622$ or greater

### Statistical intervals in Python

In [8]:
import scipy.stats as st
import numpy as np

# 1.For normal distribution use st.norm.interval(CI, xbar, sigma/sqrt(n))
# Example 5
# given n = 12, sigma=1000^0.5, xbar = 3250, CI = 0.95
# answer 3232.108 < mu < 3267.89

xbar = 3250
sigma = 1000**0.5
n = 12
CI = 0.95
mu_range = st.norm.interval(CI,
                            xbar,
                            sigma/np.sqrt(n))
print("%.2f"%mu_range[0], "< mu < ", "%.2f"%mu_range[1]) 

3232.11 < mu <  3267.89


In [9]:
# 2.For t distribution use st.t.interval(CI, n-1, xbar, s/sqrt(n))
# Example 8
# Given n = 16, xbar = 60139.7, s = 3645.94 and $CI = 0.95$
# answer  58197.3 < mu <  62082.1
xbar = 60139.7
s = 3645.94
n = 16
CI = 0.95
mu_range = st.t.interval(CI,
                         n-1,
                         xbar,
                         s/np.sqrt(n))
print("%.2f"%mu_range[0], "< mu < ", "%.2f"%mu_range[1])

58196.92 < mu <  62082.48
